WeRateDogs twitter data wrangling

Devin McCormack

Gathering Data for this Project

Gather each of the three pieces of data as described below in a Jupyter Notebook titled wrangle_act.ipynb:

  1. The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv

  2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

  3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

Gather

Import manually downloaded WeRateDogs twitter archive from 'twitter-archive-enhanced.csv'

In [1]:
import pandas as pd
import requests
import os
In [2]:
df_dog=pd.read_csv('twitter-archive-enhanced.csv')
In [3]:
df_dog.head()
Out[3]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" r... This is Phineas. He's a mystical boy. Only eve... NaN NaN NaN https://twitter.com/dog_rates/status/892420643... 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" r... This is Tilly. She's just checking pup on you.... NaN NaN NaN https://twitter.com/dog_rates/status/892177421... 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Archie. He is a rare Norwegian Pouncin... NaN NaN NaN https://twitter.com/dog_rates/status/891815181... 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Darla. She commenced a snooze mid meal... NaN NaN NaN https://twitter.com/dog_rates/status/891689557... 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" r... This is Franklin. He would like you to stop ca... NaN NaN NaN https://twitter.com/dog_rates/status/891327558... 12 10 Franklin None None None None

Programatically request for the image predictions from Udacity servers

In [4]:
url='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response=requests.get(url)

with open(url.split('/')[-1],mode='wb') as file:
    file.write(response.content)

df_breed=pd.read_csv('image-predictions.tsv',sep='\t')

df_breed.head()
Out[4]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

Use Twiter API to access each tweet's json file and read into a dataframe

In [5]:
import tweepy

accesskey=pd.read_csv('twittertoken.csv')

consumer_key = accesskey.consumer_key[0]
consumer_secret = accesskey.consumer_secret[0]
access_token = accesskey.access_token[0]
access_secret = accesskey.access_secret[0]

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth_handler=auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
In [6]:
# api.get_status(id_of_tweet)
import json
# try:
#     os.remove('tweet_json.txt')
# except:
#     pass
if os.path.isfile('tweet_json.txt'):
    print('file already exists')
else:
    count=0
    for tweet_id in df_dog.tweet_id:
        try:
            tweet=api.get_status(tweet_id,tweet_mode='extended')
            writetweet=tweet._json
            with open('tweet_json.txt',mode='a',encoding='utf-8') as file:
                json.dump(writetweet,file)
                file.write('\n')
            count+=1
            print(count,tweet._json.get('id_str'))
        except:
            with open('tweet_json.txt',mode='a',encoding='utf-8') as file:
                file.write('\n')
            count+=1
            print(count,'TWEET NOT FOUND!')
file already exists

Probe pretty-printed example of JSON, compare with tweet data dictionary to find important attributes

In [7]:
with open('tweet_json.txt',mode='r') as file:
    tweet=json.dumps(json.loads(file.readline()),indent=4)
print(tweet)   
{
    "created_at": "Tue Aug 01 16:23:56 +0000 2017",
    "id": 892420643555336193,
    "id_str": "892420643555336193",
    "full_text": "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
    "truncated": false,
    "display_text_range": [
        0,
        85
    ],
    "entities": {
        "hashtags": [],
        "symbols": [],
        "user_mentions": [],
        "urls": [],
        "media": [
            {
                "id": 892420639486877696,
                "id_str": "892420639486877696",
                "indices": [
                    86,
                    109
                ],
                "media_url": "http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg",
                "media_url_https": "https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg",
                "url": "https://t.co/MgUWQ76dJU",
                "display_url": "pic.twitter.com/MgUWQ76dJU",
                "expanded_url": "https://twitter.com/dog_rates/status/892420643555336193/photo/1",
                "type": "photo",
                "sizes": {
                    "thumb": {
                        "w": 150,
                        "h": 150,
                        "resize": "crop"
                    },
                    "medium": {
                        "w": 540,
                        "h": 528,
                        "resize": "fit"
                    },
                    "small": {
                        "w": 540,
                        "h": 528,
                        "resize": "fit"
                    },
                    "large": {
                        "w": 540,
                        "h": 528,
                        "resize": "fit"
                    }
                }
            }
        ]
    },
    "extended_entities": {
        "media": [
            {
                "id": 892420639486877696,
                "id_str": "892420639486877696",
                "indices": [
                    86,
                    109
                ],
                "media_url": "http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg",
                "media_url_https": "https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg",
                "url": "https://t.co/MgUWQ76dJU",
                "display_url": "pic.twitter.com/MgUWQ76dJU",
                "expanded_url": "https://twitter.com/dog_rates/status/892420643555336193/photo/1",
                "type": "photo",
                "sizes": {
                    "thumb": {
                        "w": 150,
                        "h": 150,
                        "resize": "crop"
                    },
                    "medium": {
                        "w": 540,
                        "h": 528,
                        "resize": "fit"
                    },
                    "small": {
                        "w": 540,
                        "h": 528,
                        "resize": "fit"
                    },
                    "large": {
                        "w": 540,
                        "h": 528,
                        "resize": "fit"
                    }
                }
            }
        ]
    },
    "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>",
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 4196983835,
        "id_str": "4196983835",
        "name": "WeRateDogs\u2122",
        "screen_name": "dog_rates",
        "location": "\ud835\udcf6\ud835\udcee\ud835\udcfb\ud835\udcec\ud835\udcf1 \u21b4      DM YOUR DOGS",
        "description": "Your Only Source for Professional Dog Ratings STORE: @ShopWeRateDogs | IG, FB & SC: WeRateDogs | MOBILE APP: @GoodDogsGame Business: dogratingtwitter@gmail.com",
        "url": "https://t.co/N7sNNHAEXS",
        "entities": {
            "url": {
                "urls": [
                    {
                        "url": "https://t.co/N7sNNHAEXS",
                        "expanded_url": "http://weratedogs.com",
                        "display_url": "weratedogs.com",
                        "indices": [
                            0,
                            23
                        ]
                    }
                ]
            },
            "description": {
                "urls": []
            }
        },
        "protected": false,
        "followers_count": 5627901,
        "friends_count": 103,
        "listed_count": 4047,
        "created_at": "Sun Nov 15 21:41:29 +0000 2015",
        "favourites_count": 130658,
        "utc_offset": null,
        "time_zone": null,
        "geo_enabled": true,
        "verified": true,
        "statuses_count": 6628,
        "lang": "en",
        "contributors_enabled": false,
        "is_translator": false,
        "is_translation_enabled": false,
        "profile_background_color": "000000",
        "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
        "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
        "profile_background_tile": false,
        "profile_image_url": "http://pbs.twimg.com/profile_images/948761950363664385/Fpr2Oz35_normal.jpg",
        "profile_image_url_https": "https://pbs.twimg.com/profile_images/948761950363664385/Fpr2Oz35_normal.jpg",
        "profile_banner_url": "https://pbs.twimg.com/profile_banners/4196983835/1515037507",
        "profile_link_color": "F5ABB5",
        "profile_sidebar_border_color": "000000",
        "profile_sidebar_fill_color": "000000",
        "profile_text_color": "000000",
        "profile_use_background_image": false,
        "has_extended_profile": true,
        "default_profile": false,
        "default_profile_image": false,
        "following": false,
        "follow_request_sent": false,
        "notifications": false,
        "translator_type": "none"
    },
    "geo": null,
    "coordinates": null,
    "place": null,
    "contributors": null,
    "is_quote_status": false,
    "retweet_count": 8704,
    "favorite_count": 39140,
    "favorited": false,
    "retweeted": false,
    "possibly_sensitive": false,
    "possibly_sensitive_appealable": false,
    "lang": "en"
}

dict keys of important features:

'id_str','favorite_count', 'retweet_count'

Features that might be interesting:

'followers_count' under user - do we need to normalize favorites/retweets by the number of current followers?

In [8]:
with open('tweet_json.txt',mode='r') as file:
    tweet=json.loads(file.readline())
In [9]:
## twitter suggests grabbing id_str to ensure that full number is grabbed
## there are potential issues with assigned int types
tweet.get('id_str')
Out[9]:
'892420643555336193'
In [10]:
tweet.get('favorite_count')
Out[10]:
39140
In [11]:
tweet.get('retweet_count')
Out[11]:
8704
In [12]:
tweet.get('user').get('followers_count')
Out[12]:
5627901

Put together loop to create dataframe from twitter JSON txt file

In [13]:
df_list=[]
with open('tweet_json.txt',mode='r') as file:
    content = file.read().splitlines()
    for line in content:
        try:
            tweet=json.loads(line)
            tweet_id=tweet.get('id_str')
            favorite_count=tweet.get('favorite_count')
            retweet_count=tweet.get('retweet_count')
            created_at=tweet.get('created_at')
            followers_count=tweet.get('user').get('followers_count')
            df_list.append({'tweet_id': tweet_id,
                            'favorite_count': favorite_count,
                            'retweet_count': retweet_count,
                            'followers_count': followers_count})
        
        except:
            pass
        
        
        
df_tweet = pd.DataFrame(df_list, columns = ['tweet_id', 'favorite_count', 'retweet_count','followers_count'])

df_tweet.head()
Out[13]:
tweet_id favorite_count retweet_count followers_count
0 892420643555336193 39140 8704 5627901
1 892177421306343426 33522 6388 5627901
2 891815181378084864 25264 4245 5627901
3 891689557279858688 42487 8801 5627901
4 891327558926688256 40664 9583 5627901

Assess

After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

In [14]:
df_dog.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [15]:
df_dog[df_dog.expanded_urls.isnull()][['tweet_id','in_reply_to_status_id','retweeted_status_id']]
Out[15]:
tweet_id in_reply_to_status_id retweeted_status_id
30 886267009285017600 8.862664e+17 NaN
55 881633300179243008 8.816070e+17 NaN
64 879674319642796034 8.795538e+17 NaN
113 870726314365509632 8.707262e+17 NaN
148 863427515083354112 8.634256e+17 NaN
179 857214891891077121 8.571567e+17 NaN
185 856330835276025856 NaN 8.563302e+17
186 856288084350160898 8.562860e+17 NaN
188 855862651834028034 8.558616e+17 NaN
189 855860136149123072 8.558585e+17 NaN
218 850333567704068097 8.503288e+17 NaN
228 848213670039564288 8.482121e+17 NaN
234 847617282490613760 8.476062e+17 NaN
274 840698636975636481 8.406983e+17 NaN
290 838150277551247360 8.381455e+17 NaN
291 838085839343206401 8.380855e+17 NaN
313 835246439529840640 8.352460e+17 NaN
342 832088576586297345 8.320875e+17 NaN
346 831926988323639298 8.319030e+17 NaN
375 828361771580813312 NaN NaN
387 826598799820865537 8.265984e+17 NaN
409 823333489516937216 8.233264e+17 NaN
427 821153421864615936 8.211526e+17 NaN
498 813130366689148928 8.131273e+17 NaN
513 811647686436880384 8.116272e+17 NaN
570 801854953262350336 8.018543e+17 NaN
576 800859414831898624 8.008580e+17 NaN
611 797165961484890113 7.971238e+17 NaN
701 786051337297522688 7.727430e+17 NaN
707 785515384317313025 NaN NaN
843 766714921925144576 7.667118e+17 NaN
857 763956972077010945 7.638652e+17 NaN
967 750381685133418496 7.501805e+17 NaN
1005 747651430853525504 7.476487e+17 NaN
1080 738891149612572673 7.384119e+17 NaN
1295 707983188426153984 7.079801e+17 NaN
1345 704491224099647488 7.044857e+17 NaN
1445 696518437233913856 NaN NaN
1446 696490539101908992 6.964887e+17 NaN
1474 693644216740769793 6.936422e+17 NaN
1479 693582294167244802 6.935722e+17 NaN
1497 692423280028966913 6.924173e+17 NaN
1523 690607260360429569 6.903413e+17 NaN
1598 686035780142297088 6.860340e+17 NaN
1605 685681090388975616 6.855479e+17 NaN
1618 684969860808454144 6.849598e+17 NaN
1663 682808988178739200 6.827884e+17 NaN
1689 681340665377193984 6.813394e+17 NaN
1774 678023323247357953 6.780211e+17 NaN
1819 676590572941893632 6.765883e+17 NaN
1844 675849018447167488 6.758457e+17 NaN
1895 674742531037511680 6.747400e+17 NaN
1905 674606911342424069 6.744689e+17 NaN
1914 674330906434379776 6.658147e+17 NaN
1940 673716320723169284 6.737159e+17 NaN
2038 671550332464455680 6.715449e+17 NaN
2149 669684865554620416 6.693544e+17 NaN
2189 668967877119254528 6.689207e+17 NaN
2298 667070482143944705 6.670655e+17 NaN
In [16]:
df_dog.describe()
Out[16]:
tweet_id in_reply_to_status_id in_reply_to_user_id retweeted_status_id retweeted_status_user_id rating_numerator rating_denominator
count 2.356000e+03 7.800000e+01 7.800000e+01 1.810000e+02 1.810000e+02 2356.000000 2356.000000
mean 7.427716e+17 7.455079e+17 2.014171e+16 7.720400e+17 1.241698e+16 13.126486 10.455433
std 6.856705e+16 7.582492e+16 1.252797e+17 6.236928e+16 9.599254e+16 45.876648 6.745237
min 6.660209e+17 6.658147e+17 1.185634e+07 6.661041e+17 7.832140e+05 0.000000 0.000000
25% 6.783989e+17 6.757419e+17 3.086374e+08 7.186315e+17 4.196984e+09 10.000000 10.000000
50% 7.196279e+17 7.038708e+17 4.196984e+09 7.804657e+17 4.196984e+09 11.000000 10.000000
75% 7.993373e+17 8.257804e+17 4.196984e+09 8.203146e+17 4.196984e+09 12.000000 10.000000
max 8.924206e+17 8.862664e+17 8.405479e+17 8.874740e+17 7.874618e+17 1776.000000 170.000000
In [17]:
df_dog.rating_denominator.value_counts()
Out[17]:
10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64
In [18]:
df_dog.rating_numerator.value_counts()
Out[18]:
12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64
In [19]:
df_dog[df_dog.rating_numerator==2]
Out[19]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
1761 678675843183484930 NaN NaN 2015-12-20 20:38:24 +0000 <a href="http://twitter.com/download/iphone" r... Exotic pup here. Tail long af. Throat looks sw... NaN NaN NaN https://twitter.com/dog_rates/status/678675843... 2 10 None None None None None
1764 678424312106393600 NaN NaN 2015-12-20 03:58:55 +0000 <a href="http://twitter.com/download/iphone" r... This is Crystal. She's a shitty fireman. No se... NaN NaN NaN https://twitter.com/dog_rates/status/678424312... 2 10 Crystal None None None None
1920 674265582246694913 NaN NaN 2015-12-08 16:33:36 +0000 <a href="http://twitter.com/download/iphone" r... This is Henry. He's a shit dog. Short pointy e... NaN NaN NaN https://twitter.com/dog_rates/status/674265582... 2 10 Henry None None None None
2079 670826280409919488 NaN NaN 2015-11-29 04:47:03 +0000 <a href="http://twitter.com/download/iphone" r... Scary dog here. Too many legs. Extra tail. Not... NaN NaN NaN https://twitter.com/dog_rates/status/670826280... 2 10 None None None None None
2237 668142349051129856 NaN NaN 2015-11-21 19:02:04 +0000 <a href="http://twitter.com/download/iphone" r... This lil pup is Oliver. Hops around. Has wings... NaN NaN NaN https://twitter.com/dog_rates/status/668142349... 2 10 None None None None None
2246 667878741721415682 NaN NaN 2015-11-21 01:34:35 +0000 <a href="http://twitter.com/download/iphone" r... This is Tedrick. He lives on the edge. Needs s... NaN NaN NaN https://twitter.com/dog_rates/status/667878741... 2 10 Tedrick None None None None
2310 666786068205871104 NaN NaN 2015-11-18 01:12:41 +0000 <a href="http://twitter.com/download/iphone" r... Unfamiliar with this breed. Ears pointy af. Wo... NaN NaN NaN https://twitter.com/dog_rates/status/666786068... 2 10 None None None None None
2326 666411507551481857 NaN NaN 2015-11-17 00:24:19 +0000 <a href="http://twitter.com/download/iphone" r... This is quite the dog. Gets really excited whe... NaN NaN NaN https://twitter.com/dog_rates/status/666411507... 2 10 quite None None None None
2349 666051853826850816 NaN NaN 2015-11-16 00:35:11 +0000 <a href="http://twitter.com/download/iphone" r... This is an odd dog. Hard on the outside but lo... NaN NaN NaN https://twitter.com/dog_rates/status/666051853... 2 10 an None None None None
In [20]:
df_dog[df_dog.tweet_id.duplicated()]
Out[20]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
In [21]:
df_dog.source.value_counts()
Out[21]:
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64
In [22]:
df_dog[df_dog.source=='<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>']
Out[22]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
529 808344865868283904 NaN NaN 2016-12-12 16:16:49 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Seamus. He's very bad at entering pool... NaN NaN NaN https://vine.co/v/5QWd3LZqXxd 11 10 Seamus None None None None
562 802600418706604034 NaN NaN 2016-11-26 19:50:26 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Bailey. She has mastered the head tilt... NaN NaN NaN https://vine.co/v/5FwUWjYaW0Y 11 10 Bailey None None None None
657 791774931465953280 NaN NaN 2016-10-27 22:53:48 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Vine will be deeply missed. This was by far my... NaN NaN NaN https://vine.co/v/ea0OwvPTx9l 14 10 None None None None None
672 789903600034189313 NaN NaN 2016-10-22 18:57:48 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Ralphy. His dreams were just shattered... NaN NaN NaN https://vine.co/v/5wPT1aBxPQZ 13 10 Ralphy None None pupper None
699 786286427768250368 NaN NaN 2016-10-12 19:24:27 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Arnie. He's afraid of his own bark. 12... NaN NaN NaN https://vine.co/v/5XH0WqHwiFp 12 10 Arnie None None None None
713 784183165795655680 NaN NaN 2016-10-07 00:06:50 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Reginald. He's one magical puppo. Aero... NaN NaN NaN https://vine.co/v/5ghHLBMMdlV 12 10 Reginald None None None puppo
714 784057939640352768 NaN NaN 2016-10-06 15:49:14 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Balto. He's very content. Legendary to... NaN NaN NaN https://vine.co/v/5gKxeUpuKEr 12 10 Balto None None None None
731 781655249211752448 NaN NaN 2016-09-30 00:41:48 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Combo. The daily struggles of being a ... NaN NaN NaN https://vine.co/v/5rt6T3qm7hL 11 10 Combo doggo None None None
733 781308096455073793 NaN NaN 2016-09-29 01:42:20 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Pupper butt 1, Doggo 0. Both 12/10 https://t.c... NaN NaN NaN https://vine.co/v/5rgu2Law2ut 12 10 None doggo None pupper None
746 780074436359819264 NaN NaN 2016-09-25 16:00:13 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Here's a doggo questioning his entire existenc... NaN NaN NaN https://vine.co/v/5nzYBpl0TY2 10 10 None doggo None None None
783 775350846108426240 NaN NaN 2016-09-12 15:10:21 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Maximus. A little rain won't stop him.... NaN NaN NaN https://vine.co/v/ijmv0PD0XXD 12 10 Maximus None None None None
881 760521673607086080 NaN NaN 2016-08-02 17:04:31 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Doggo want what doggo cannot have. Temptation ... NaN NaN NaN https://vine.co/v/5ApKetxzmTB 12 10 None doggo None None None
886 759943073749200896 NaN NaN 2016-08-01 02:45:22 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Here's a wicked fast pupper. 12/10 camera coul... NaN NaN NaN https://vine.co/v/5AJm5pq7Kav 12 10 None None None pupper None
905 758099635764359168 NaN NaN 2016-07-27 00:40:12 +0000 <a href="http://vine.co" rel="nofollow">Vine -... In case you haven't seen the most dramatic sne... NaN NaN NaN https://vine.co/v/hQJbaj1VpIz 13 10 None None None None None
939 753039830821511168 NaN NaN 2016-07-13 01:34:21 +0000 <a href="http://vine.co" rel="nofollow">Vine -... So this just changed my life. 13/10 please enj... NaN NaN NaN https://vine.co/v/5W2Dg3XPX7a 13 10 None None None None None
941 752932432744185856 NaN NaN 2016-07-12 18:27:35 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Carl. He's very powerful. 12/10 don't ... NaN NaN NaN https://vine.co/v/OEppMFbejFz 12 10 Carl None None None None
946 752568224206688256 NaN NaN 2016-07-11 18:20:21 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Here are three doggos completely misjudging an... NaN NaN NaN https://vine.co/v/5W0bdhEUUVT 9 10 None None None None None
951 751950017322246144 NaN NaN 2016-07-10 01:23:49 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Lola. She's a surfing pupper. 13/10 ma... NaN NaN NaN https://vine.co/v/5WrjaYAMvMO 13 10 Lola None None pupper None
954 751793661361422336 NaN NaN 2016-07-09 15:02:31 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Fred. He's having one heck of a summer... NaN NaN NaN https://vine.co/v/5W5YHdTJvaV 11 10 Fred None None None None
985 749075273010798592 NaN NaN 2016-07-02 03:00:36 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Boomer. He's self-baptizing. Other dog... NaN NaN NaN https://vine.co/v/5ztZvHgI17r 11 10 Boomer doggo None None None
996 748337862848962560 NaN NaN 2016-06-30 02:10:24 +0000 <a href="http://vine.co" rel="nofollow">Vine -... SWIM AWAY PUPPER SWIM AWAY 13/10 #BarkWeek ht... NaN NaN NaN https://vine.co/v/h5aDaFthX6O 13 10 None None None pupper None
999 748220828303695873 NaN NaN 2016-06-29 18:25:21 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Stop what you're doing and watch this heckin m... NaN NaN NaN https://vine.co/v/iiLjKuYJpr6 13 10 None None None None None
1006 747648653817413632 NaN NaN 2016-06-28 04:31:44 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Keurig. He apparently headbutts other ... NaN NaN NaN https://vine.co/v/iqIZFtOxEMB 12 10 Keurig None None None None
1011 747439450712596480 NaN NaN 2016-06-27 14:40:26 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Linus. He just wanted to say hello but... NaN NaN NaN https://vine.co/v/5uTVXWvn3Ip 12 10 Linus None None None None
1020 746757706116112384 NaN NaN 2016-06-25 17:31:25 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Maddie. She gets some wicked air time.... NaN NaN NaN https://vine.co/v/5BYq6hmrEI3 11 10 Maddie None None None None
1022 746542875601690625 NaN NaN 2016-06-25 03:17:46 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Here's a golden floofer helping with the groce... NaN NaN NaN https://vine.co/v/5uZYwqmuDeT 11 10 None None floofer None None
1033 745074613265149952 NaN NaN 2016-06-21 02:03:25 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Jeffrey. He wasn't prepared to execute... NaN NaN NaN https://vine.co/v/iQm3JAXuFmv 11 10 Jeffrey None None None None
1051 742534281772302336 NaN NaN 2016-06-14 01:49:03 +0000 <a href="http://vine.co" rel="nofollow">Vine -... For anyone who's wondering, this is what happe... NaN NaN NaN https://vine.co/v/iLTZmtE1FTB 11 10 None doggo None None None
1062 741099773336379392 NaN NaN 2016-06-10 02:48:49 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Ted. He's given up. 11/10 relatable af... NaN NaN NaN https://vine.co/v/ixHYvdxUx1L 11 10 Ted None None None None
1075 739623569819336705 NaN NaN 2016-06-06 01:02:55 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Here's a doggo that don't need no human. 12/10... NaN NaN NaN https://vine.co/v/iY9Fr1I31U6 12 10 None doggo None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1482 693267061318012928 NaN NaN 2016-01-30 02:58:42 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Oscar. He can wave. Friendly af. 12/10... NaN NaN NaN https://vine.co/v/i5n2irFUYWv 12 10 Oscar None None None None
1502 692041934689402880 NaN NaN 2016-01-26 17:50:29 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Teddy. His head is too heavy. 13/10 (v... NaN NaN NaN https://vine.co/v/iiI3wmqXYmA 13 10 Teddy None None None None
1505 691793053716221953 NaN NaN 2016-01-26 01:21:31 +0000 <a href="http://vine.co" rel="nofollow">Vine -... We usually don't rate penguins but this one is... NaN NaN NaN https://vine.co/v/OTTVAKw6YlW 10 10 None None None None None
1515 690989312272396288 NaN NaN 2016-01-23 20:07:44 +0000 <a href="http://vine.co" rel="nofollow">Vine -... We've got a doggy down. Requesting backup. 12/... NaN NaN NaN https://vine.co/v/iOZKZEU2nHq 12 10 None None None None None
1528 690348396616552449 NaN NaN 2016-01-22 01:40:58 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Oddie. He's trying to communicate. 12/... NaN NaN NaN https://vine.co/v/iejBWerY9X2 12 10 Oddie None None None None
1534 689993469801164801 NaN NaN 2016-01-21 02:10:37 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Here we are witnessing a rare High Stepping Al... NaN NaN NaN https://vine.co/v/ienexVMZgi5 12 10 None None floofer None None
1549 689255633275777024 NaN NaN 2016-01-19 01:18:43 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Ferg. He swallowed a chainsaw. 1 like ... NaN NaN NaN https://vine.co/v/iOL792n5hz2 10 10 Ferg None None None None
1566 687841446767013888 NaN NaN 2016-01-15 03:39:15 +0000 <a href="http://vine.co" rel="nofollow">Vine -... 13/10 I can't stop watching this (vid by @k8ly... NaN NaN NaN https://vine.co/v/iOWwUPH1hrw 13 10 None None None None None
1570 687732144991551489 NaN NaN 2016-01-14 20:24:55 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Ember. That's the q-tip she owes money... NaN NaN NaN https://vine.co/v/iOuMphL5DBY 11 10 Ember None None None None
1577 687399393394311168 NaN NaN 2016-01-13 22:22:41 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Barry. He's very fast. I hope he finds... NaN NaN NaN https://vine.co/v/iM2hLu9LU5i 10 10 Barry None None None None
1586 686760001961103360 NaN NaN 2016-01-12 04:01:58 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This pupper forgot how to walk. 12/10 happens ... NaN NaN NaN https://vine.co/v/iMvubwT260D 12 10 None None None pupper None
1592 686394059078897668 NaN NaN 2016-01-11 03:47:50 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This pup's having a nightmare that he forgot t... NaN NaN NaN https://vine.co/v/iMqBebnOvav 12 10 None None None None None
1596 686286779679375361 NaN NaN 2016-01-10 20:41:33 +0000 <a href="http://vine.co" rel="nofollow">Vine -... When bae calls your name from across the room.... NaN NaN NaN https://vine.co/v/iMZx6aDbExn 12 10 None None None None None
1625 684830982659280897 NaN NaN 2016-01-06 20:16:44 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This little fella really hates stairs. Prefers... NaN NaN NaN https://vine.co/v/eEZXZI1rqxX 13 10 None None None pupper None
1628 684588130326986752 NaN NaN 2016-01-06 04:11:43 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This pupper just got his first kiss. 12/10 he'... NaN NaN NaN https://vine.co/v/ihWIxntjtO7 12 10 None None None pupper None
1640 684147889187209216 NaN NaN 2016-01-04 23:02:22 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Sweets the English Bulldog. Waves back... NaN NaN NaN https://vine.co/v/ib2nTOEuuOI 12 10 Sweets None None pupper None
1650 683515932363329536 NaN NaN 2016-01-03 05:11:12 +0000 <a href="http://vine.co" rel="nofollow">Vine -... HEY PUP WHAT'S THE PART OF THE HUMAN BODY THAT... NaN NaN NaN https://vine.co/v/ibvnzrauFuV 11 10 None None None None None
1676 682088079302213632 NaN NaN 2015-12-30 06:37:25 +0000 <a href="http://vine.co" rel="nofollow">Vine -... I'm not sure what this dog is doing but it's p... NaN NaN NaN https://vine.co/v/iqMjlxULzbn 12 10 None None None None None
1706 680805554198020098 NaN NaN 2015-12-26 17:41:07 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This guy's dog broke. So sad. 9/10 would still... NaN NaN NaN https://vine.co/v/iAP0Ugzi2PO 9 10 None None None None None
1728 679872969355714560 NaN NaN 2015-12-24 03:55:21 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Rocco. He's in a very intense game of ... NaN NaN NaN https://vine.co/v/iAAxTbj1UAM 10 10 Rocco None None None None
1743 679405845277462528 NaN NaN 2015-12-22 20:59:10 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Crazy unseen footage from Jurassic Park. 10/10... NaN NaN NaN https://vine.co/v/iKVFEigMLxP 10 10 None None None None None
1750 679001094530465792 NaN NaN 2015-12-21 18:10:50 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Rascal. He's paddling an imaginary can... NaN NaN NaN https://vine.co/v/iKIwAzEatd6 11 10 Rascal None None None None
1760 678708137298427904 NaN NaN 2015-12-20 22:46:44 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Here we are witnessing a wild field pupper. Lo... NaN NaN NaN https://vine.co/v/eQjxxYaQ60K 10 10 None None None pupper None
1776 677961670166224897 NaN NaN 2015-12-18 21:20:32 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Izzy. She's showing off the dance move... NaN NaN NaN https://vine.co/v/iKuMDuYV0aZ 11 10 Izzy None None None None
1791 677335745548390400 NaN NaN 2015-12-17 03:53:20 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Downright inspiring 12/10 https://t.co/vSLtYBWHcQ NaN NaN NaN https://vine.co/v/hbLbH77Ar67 12 10 None None None None None
1807 676916996760600576 NaN NaN 2015-12-16 00:09:23 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Super speedy pupper. Does not go gentle into t... NaN NaN NaN https://vine.co/v/imJ0BdZOJTw 10 10 None None None pupper None
1818 676593408224403456 NaN NaN 2015-12-15 02:43:33 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This pupper loves leaves. 11/10 for committed ... NaN NaN NaN https://vine.co/v/eEQQaPFbgOY 11 10 None None None pupper None
1834 676121918416756736 NaN NaN 2015-12-13 19:30:01 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Here we are witnessing a very excited dog. Cle... NaN NaN NaN https://vine.co/v/iZXg7VpeDAv 8 10 None None None None None
1916 674307341513269249 NaN NaN 2015-12-08 19:19:32 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is life-changing. 12/10 https://t.co/SroT... NaN NaN NaN https://vine.co/v/i7nWzrenw5h 12 10 life None None None None
2212 668587383441514497 NaN NaN 2015-11-23 00:30:28 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Never forget this vine. You will not stop watc... NaN NaN NaN https://vine.co/v/ea0OwvPTx9l 13 10 the None None None None

91 rows × 17 columns

In [23]:
df_dog.name.value_counts()
Out[23]:
None            745
a                55
Charlie          12
Cooper           11
Lucy             11
Oliver           11
Penny            10
Tucker           10
Lola             10
Winston           9
Bo                9
Sadie             8
the               8
Toby              7
Bailey            7
an                7
Daisy             7
Buddy             7
Rusty             6
Stanley           6
Bella             6
Milo              6
Scout             6
Oscar             6
Dave              6
Leo               6
Koda              6
Jack              6
Jax               6
very              5
               ... 
Rizzo             1
Stuart            1
Maxwell           1
Pumpkin           1
Heinrich          1
Winifred          1
Pippin            1
Gin               1
Shnuggles         1
Brandi            1
Clarq             1
Crumpet           1
Iggy              1
Dotsy             1
Stormy            1
Bluebert          1
Dietrich          1
Tessa             1
Brockly           1
Antony            1
Tupawc            1
Cleopatricia      1
Sailor            1
Taz               1
Genevieve         1
Bowie             1
Jersey            1
Kevon             1
Huxley            1
Fillup            1
Name: name, Length: 957, dtype: int64
In [24]:
df_dog.doggo.value_counts()
Out[24]:
None     2259
doggo      97
Name: doggo, dtype: int64
In [25]:
df_dog[df_dog.doggo!='None']
Out[25]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
9 890240255349198849 NaN NaN 2017-07-26 15:59:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Cassie. She is a college pup. Studying... NaN NaN NaN https://twitter.com/dog_rates/status/890240255... 14 10 Cassie doggo None None None
43 884162670584377345 NaN NaN 2017-07-09 21:29:42 +0000 <a href="http://twitter.com/download/iphone" r... Meet Yogi. He doesn't have any important dog m... NaN NaN NaN https://twitter.com/dog_rates/status/884162670... 12 10 Yogi doggo None None None
99 872967104147763200 NaN NaN 2017-06-09 00:02:31 +0000 <a href="http://twitter.com/download/iphone" r... Here's a very large dog. He has a date later. ... NaN NaN NaN https://twitter.com/dog_rates/status/872967104... 12 10 None doggo None None None
108 871515927908634625 NaN NaN 2017-06-04 23:56:03 +0000 <a href="http://twitter.com/download/iphone" r... This is Napolean. He's a Raggedy East Nicaragu... NaN NaN NaN https://twitter.com/dog_rates/status/871515927... 12 10 Napolean doggo None None None
110 871102520638267392 NaN NaN 2017-06-03 20:33:19 +0000 <a href="http://twitter.com/download/iphone" r... Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH NaN NaN NaN https://twitter.com/animalcog/status/871075758... 14 10 None doggo None None None
121 869596645499047938 NaN NaN 2017-05-30 16:49:31 +0000 <a href="http://twitter.com/download/iphone" r... This is Scout. He just graduated. Officially a... NaN NaN NaN https://twitter.com/dog_rates/status/869596645... 12 10 Scout doggo None None None
172 858843525470990336 NaN NaN 2017-05-01 00:40:27 +0000 <a href="http://twitter.com/download/iphone" r... I have stumbled puppon a doggo painting party.... NaN NaN NaN https://twitter.com/dog_rates/status/858843525... 13 10 None doggo None None None
191 855851453814013952 NaN NaN 2017-04-22 18:31:02 +0000 <a href="http://twitter.com/download/iphone" r... Here's a puppo participating in the #ScienceMa... NaN NaN NaN https://twitter.com/dog_rates/status/855851453... 13 10 None doggo None None puppo
200 854010172552949760 NaN NaN 2017-04-17 16:34:26 +0000 <a href="http://twitter.com/download/iphone" r... At first I thought this was a shy doggo, but i... NaN NaN NaN https://twitter.com/dog_rates/status/854010172... 11 10 None doggo floofer None None
211 851953902622658560 NaN NaN 2017-04-12 00:23:33 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Astrid. She's a guide d... 8.293743e+17 4.196984e+09 2017-02-08 17:00:26 +0000 https://twitter.com/dog_rates/status/829374341... 13 10 Astrid doggo None None None
240 846514051647705089 NaN NaN 2017-03-28 00:07:32 +0000 <a href="http://twitter.com/download/iphone" r... This is Barney. He's an elder doggo. Hitches a... NaN NaN NaN https://twitter.com/dog_rates/status/846514051... 13 10 Barney doggo None None None
248 845397057150107648 NaN NaN 2017-03-24 22:08:59 +0000 <a href="http://twitter.com/download/iphone" r... Say hello to Mimosa. She's an emotional suppor... NaN NaN NaN https://www.gofundme.com/help-save-a-pup,https... 13 10 Mimosa doggo None None None
300 836753516572119041 NaN NaN 2017-03-01 01:42:39 +0000 <a href="http://twitter.com/download/iphone" r... This is Meera. She just heard about taxes and ... NaN NaN NaN https://twitter.com/dog_rates/status/836753516... 12 10 Meera doggo None None None
318 834574053763584002 NaN NaN 2017-02-23 01:22:14 +0000 <a href="http://twitter.com/download/iphone" r... Here's a doggo fully pupared for a shower. H*c... NaN NaN NaN https://twitter.com/dog_rates/status/834574053... 13 10 None doggo None None None
323 834089966724603904 NaN NaN 2017-02-21 17:18:39 +0000 <a href="http://twitter.com/download/iphone" r... DOGGO ON THE LOOSE I REPEAT DOGGO ON THE LOOSE... NaN NaN NaN https://twitter.com/stevekopack/status/8340866... 10 10 None doggo None None None
331 832998151111966721 NaN NaN 2017-02-18 17:00:10 +0000 <a href="http://twitter.com/download/iphone" r... This is Rhino. He arrived at a shelter with an... NaN NaN NaN https://twitter.com/dog_rates/status/832998151... 13 10 Rhino doggo None None None
339 832273440279240704 NaN NaN 2017-02-16 17:00:25 +0000 <a href="http://twitter.com/download/iphone" r... Say hello to Smiley. He's a blind therapy dogg... NaN NaN NaN https://twitter.com/dog_rates/status/832273440... 14 10 Smiley doggo None None None
344 832032802820481025 NaN NaN 2017-02-16 01:04:13 +0000 <a href="http://twitter.com/download/iphone" r... This is Miguel. He was the only remaining dogg... NaN NaN NaN https://www.petfinder.com/petdetail/34918210,h... 12 10 Miguel doggo None None None
345 831939777352105988 NaN NaN 2017-02-15 18:54:34 +0000 <a href="http://twitter.com/download/iphone" r... This is Emanuel. He's a h*ckin rare doggo. Dwe... NaN NaN NaN https://twitter.com/dog_rates/status/831939777... 12 10 Emanuel doggo None None None
351 831322785565769729 NaN NaN 2017-02-14 02:02:51 +0000 <a href="http://twitter.com/download/iphone" r... This is Pete. He has no eyes. Needs a guide do... NaN NaN NaN https://twitter.com/dog_rates/status/831322785... 12 10 Pete doggo None None None
359 829878982036299777 NaN NaN 2017-02-10 02:25:42 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Loki. He smiles like El... 8.269587e+17 4.196984e+09 2017-02-02 01:01:21 +0000 https://twitter.com/dog_rates/status/826958653... 12 10 Loki doggo None None None
362 829449946868879360 NaN NaN 2017-02-08 22:00:52 +0000 <a href="http://twitter.com/download/iphone" r... Here's a stressed doggo. Had a long day. Many ... NaN NaN NaN https://twitter.com/dog_rates/status/829449946... 11 10 None doggo None None None
363 829374341691346946 NaN NaN 2017-02-08 17:00:26 +0000 <a href="http://twitter.com/download/iphone" r... This is Astrid. She's a guide doggo in trainin... NaN NaN NaN https://twitter.com/dog_rates/status/829374341... 13 10 Astrid doggo None None None
372 828381636999917570 NaN NaN 2017-02-05 23:15:47 +0000 <a href="http://twitter.com/download/iphone" r... Meet Doobert. He's a deaf doggo. Didn't stop h... NaN NaN NaN https://twitter.com/dog_rates/status/828381636... 14 10 Doobert doggo None None None
384 826958653328592898 NaN NaN 2017-02-02 01:01:21 +0000 <a href="http://twitter.com/download/iphone" r... This is Loki. He smiles like Elvis. Ain't noth... NaN NaN NaN https://twitter.com/dog_rates/status/826958653... 12 10 Loki doggo None None None
385 826848821049180160 NaN NaN 2017-02-01 17:44:55 +0000 <a href="http://twitter.com/download/iphone" r... This is Cupid. He was found in the trash. Now ... NaN NaN NaN https://twitter.com/dog_rates/status/826848821... 13 10 Cupid doggo None None None
389 826476773533745153 NaN NaN 2017-01-31 17:06:32 +0000 <a href="http://twitter.com/download/iphone" r... This is Pilot. He has mastered the synchronize... NaN NaN NaN https://twitter.com/dog_rates/status/826476773... 12 10 Pilot doggo None None None
391 826204788643753985 NaN NaN 2017-01-30 23:05:46 +0000 <a href="http://twitter.com/download/iphone" r... Here's a little more info on Dew, your favorit... NaN NaN NaN http://us.blastingnews.com/news/2017/01/kentuc... 13 10 None doggo None None None
423 821765923262631936 NaN NaN 2017-01-18 17:07:18 +0000 <a href="http://twitter.com/download/iphone" r... This is Duchess. She uses dark doggo forces to... NaN NaN NaN https://twitter.com/dog_rates/status/821765923... 13 10 Duchess doggo None None None
425 821421320206483457 NaN NaN 2017-01-17 18:17:58 +0000 <a href="http://twitter.com/download/iphone" r... RT @dog_rates: This is Sampson. He just gradua... 7.823059e+17 4.196984e+09 2016-10-01 19:47:08 +0000 https://twitter.com/dog_rates/status/782305867... 12 10 Sampson doggo None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
857 763956972077010945 7.638652e+17 1.584641e+07 2016-08-12 04:35:10 +0000 <a href="http://twitter.com/download/iphone" r... @TheEllenShow I'm not sure if you know this bu... NaN NaN NaN NaN 12 10 None doggo None None None
877 760893934457552897 NaN NaN 2016-08-03 17:43:45 +0000 <a href="http://twitter.com/download/iphone" r... This is Wishes. He has the day off. Daily stru... NaN NaN NaN https://twitter.com/dog_rates/status/760893934... 11 10 Wishes doggo None None None
881 760521673607086080 NaN NaN 2016-08-02 17:04:31 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Doggo want what doggo cannot have. Temptation ... NaN NaN NaN https://vine.co/v/5ApKetxzmTB 12 10 None doggo None None None
889 759793422261743616 NaN NaN 2016-07-31 16:50:42 +0000 <a href="http://twitter.com/download/iphone" r... Meet Maggie &amp; Lila. Maggie is the doggo, L... NaN NaN NaN https://twitter.com/dog_rates/status/759793422... 12 10 Maggie doggo None pupper None
899 758828659922702336 NaN NaN 2016-07-29 00:57:05 +0000 <a href="http://twitter.com/download/iphone" r... This doggo is just waiting for someone to be p... NaN NaN NaN https://twitter.com/dog_rates/status/758828659... 13 10 None doggo None None None
914 757393109802180609 NaN NaN 2016-07-25 01:52:43 +0000 <a href="http://twitter.com/download/iphone" r... Here's a doggo completely oblivious to the dou... NaN NaN NaN https://twitter.com/dog_rates/status/757393109... 10 10 None doggo None None None
919 756526248105566208 NaN NaN 2016-07-22 16:28:07 +0000 <a href="http://twitter.com/download/iphone" r... All hail sky doggo. 13/10 would jump super hig... NaN NaN NaN https://twitter.com/dog_rates/status/756526248... 13 10 None doggo None None None
924 755206590534418437 NaN NaN 2016-07-19 01:04:16 +0000 <a href="http://twitter.com/download/iphone" r... This is one of the most inspirational stories ... NaN NaN NaN https://twitter.com/dog_rates/status/755206590... 14 10 one doggo None None None
944 752682090207055872 NaN NaN 2016-07-12 01:52:49 +0000 <a href="http://twitter.com/download/iphone" r... Nothing better than a doggo and a sunset. 10/1... NaN NaN NaN https://twitter.com/dog_rates/status/752682090... 10 10 None doggo None None None
945 752660715232722944 NaN NaN 2016-07-12 00:27:52 +0000 <a href="http://twitter.com/download/iphone" r... Hooman used Pokeball\n*wiggle*\n*wiggle*\nDogg... NaN NaN NaN https://twitter.com/dog_rates/status/752660715... 10 10 None doggo None None None
948 752334515931054080 NaN NaN 2016-07-11 02:51:40 +0000 <a href="http://twitter.com/download/iphone" r... Here's a doggo trying to catch some fish. 8/10... NaN NaN NaN https://twitter.com/dog_rates/status/752334515... 8 10 None doggo None None None
956 751583847268179968 NaN NaN 2016-07-09 01:08:47 +0000 <a href="http://twitter.com/download/iphone" r... Please stop sending it pictures that don't eve... NaN NaN NaN https://twitter.com/dog_rates/status/751583847... 5 10 None doggo None pupper None
967 750381685133418496 7.501805e+17 4.717297e+09 2016-07-05 17:31:49 +0000 <a href="http://twitter.com/download/iphone" r... 13/10 such a good doggo\n@spaghemily NaN NaN NaN NaN 13 10 None doggo None None None
977 750011400160841729 NaN NaN 2016-07-04 17:00:26 +0000 <a href="https://about.twitter.com/products/tw... Meet Piper. She's an airport doggo. Please ret... NaN NaN NaN https://twitter.com/dog_rates/status/750011400... 11 10 Piper doggo None None None
985 749075273010798592 NaN NaN 2016-07-02 03:00:36 +0000 <a href="http://vine.co" rel="nofollow">Vine -... This is Boomer. He's self-baptizing. Other dog... NaN NaN NaN https://vine.co/v/5ztZvHgI17r 11 10 Boomer doggo None None None
989 748932637671223296 NaN NaN 2016-07-01 17:33:49 +0000 <a href="http://twitter.com/download/iphone" r... Say hello to Divine Doggo. Must be magical af.... NaN NaN NaN https://twitter.com/dog_rates/status/748932637... 13 10 Divine doggo None None None
992 748692773788876800 NaN NaN 2016-07-01 01:40:41 +0000 <a href="http://twitter.com/download/iphone" r... That is Quizno. This is his beach. He does not... NaN NaN NaN https://twitter.com/dog_rates/status/748692773... 10 10 his doggo None None None
1030 745433870967832576 NaN NaN 2016-06-22 01:50:58 +0000 <a href="http://twitter.com/download/iphone" r... This is Lenox. She's in a wheelbarrow. Silly d... NaN NaN NaN https://twitter.com/dog_rates/status/745433870... 10 10 Lenox doggo None None None
1039 744234799360020481 NaN NaN 2016-06-18 18:26:18 +0000 <a href="http://twitter.com/download/iphone" r... Here's a doggo realizing you can stand in a po... NaN NaN NaN https://twitter.com/dog_rates/status/744234799... 13 10 None doggo None None None
1051 742534281772302336 NaN NaN 2016-06-14 01:49:03 +0000 <a href="http://vine.co" rel="nofollow">Vine -... For anyone who's wondering, this is what happe... NaN NaN NaN https://vine.co/v/iLTZmtE1FTB 11 10 None doggo None None None
1063 741067306818797568 NaN NaN 2016-06-10 00:39:48 +0000 <a href="http://twitter.com/download/iphone" r... This is just downright precious af. 12/10 for ... NaN NaN NaN https://twitter.com/dog_rates/status/741067306... 12 10 just doggo None pupper None
1075 739623569819336705 NaN NaN 2016-06-06 01:02:55 +0000 <a href="http://vine.co" rel="nofollow">Vine -... Here's a doggo that don't need no human. 12/10... NaN NaN NaN https://vine.co/v/iY9Fr1I31U6 12 10 None doggo None None None
1079 739238157791694849 NaN NaN 2016-06-04 23:31:25 +0000 <a href="http://twitter.com/download/iphone" r... Here's a doggo blowing bubbles. It's downright... NaN NaN NaN https://twitter.com/dog_rates/status/739238157... 13 10 None doggo None None None
1103 735256018284875776 NaN NaN 2016-05-24 23:47:49 +0000 <a href="http://twitter.com/download/iphone" r... This is Kellogg. He accidentally opened the fr... NaN NaN NaN https://twitter.com/dog_rates/status/735256018... 8 10 Kellogg doggo None None None
1113 733109485275860992 NaN NaN 2016-05-19 01:38:16 +0000 <a href="http://twitter.com/download/iphone" r... Like father (doggo), like son (pupper). Both 1... NaN NaN NaN https://twitter.com/dog_rates/status/733109485... 12 10 None doggo None pupper None
1117 732375214819057664 NaN NaN 2016-05-17 01:00:32 +0000 <a href="http://twitter.com/download/iphone" r... This is Kyle (pronounced 'Mitch'). He strives ... NaN NaN NaN https://twitter.com/dog_rates/status/732375214... 11 10 Kyle doggo None None None
1141 727644517743104000 NaN NaN 2016-05-03 23:42:26 +0000 <a href="http://twitter.com/download/iphone" r... Here's a doggo struggling to cope with the win... NaN NaN NaN https://twitter.com/dog_rates/status/727644517... 13 10 None doggo None None None
1156 724771698126512129 NaN NaN 2016-04-26 01:26:53 +0000 <a href="http://twitter.com/download/iphone" r... Nothin better than a doggo and a sunset. 11/10... NaN NaN NaN https://twitter.com/dog_rates/status/724771698... 11 10 None doggo None None None
1176 719991154352222208 NaN NaN 2016-04-12 20:50:42 +0000 <a href="http://twitter.com/download/iphone" r... This doggo was initially thrilled when she saw... NaN NaN NaN https://twitter.com/dog_rates/status/719991154... 10 10 None doggo None None None
1204 716080869887381504 NaN NaN 2016-04-02 01:52:38 +0000 <a href="http://twitter.com/download/iphone" r... Here's a super majestic doggo and a sunset 11/... NaN NaN NaN https://twitter.com/dog_rates/status/716080869... 11 10 None doggo None None None

97 rows × 17 columns

In [26]:
df_tweet.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2347 entries, 0 to 2346
Data columns (total 4 columns):
tweet_id           2347 non-null object
favorite_count     2347 non-null int64
retweet_count      2347 non-null int64
followers_count    2347 non-null int64
dtypes: int64(3), object(1)
memory usage: 73.4+ KB
In [27]:
df_tweet.describe()
Out[27]:
favorite_count retweet_count followers_count
count 2347.000000 2347.000000 2.347000e+03
mean 8118.085215 3065.134214 5.628142e+06
std 12204.073905 5090.205399 1.999712e+02
min 0.000000 0.000000 5.627901e+06
25% 1408.500000 611.500000 5.627924e+06
50% 3568.000000 1433.000000 5.628174e+06
75% 10054.000000 3576.000000 5.628198e+06
max 144100.000000 78219.000000 5.628722e+06
In [28]:
df_breed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [29]:
df_breed.describe()
Out[29]:
tweet_id img_num p1_conf p2_conf p3_conf
count 2.075000e+03 2075.000000 2075.000000 2.075000e+03 2.075000e+03
mean 7.384514e+17 1.203855 0.594548 1.345886e-01 6.032417e-02
std 6.785203e+16 0.561875 0.271174 1.006657e-01 5.090593e-02
min 6.660209e+17 1.000000 0.044333 1.011300e-08 1.740170e-10
25% 6.764835e+17 1.000000 0.364412 5.388625e-02 1.622240e-02
50% 7.119988e+17 1.000000 0.588230 1.181810e-01 4.944380e-02
75% 7.932034e+17 1.000000 0.843855 1.955655e-01 9.180755e-02
max 8.924206e+17 4.000000 1.000000 4.880140e-01 2.734190e-01
In [30]:
df_breed
Out[30]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True
5 666050758794694657 https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg 1 Bernese_mountain_dog 0.651137 True English_springer 0.263788 True Greater_Swiss_Mountain_dog 0.016199 True
6 666051853826850816 https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg 1 box_turtle 0.933012 False mud_turtle 0.045885 False terrapin 0.017885 False
7 666055525042405380 https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg 1 chow 0.692517 True Tibetan_mastiff 0.058279 True fur_coat 0.054449 False
8 666057090499244032 https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg 1 shopping_cart 0.962465 False shopping_basket 0.014594 False golden_retriever 0.007959 True
9 666058600524156928 https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg 1 miniature_poodle 0.201493 True komondor 0.192305 True soft-coated_wheaten_terrier 0.082086 True
10 666063827256086533 https://pbs.twimg.com/media/CT5Vg_wXIAAXfnj.jpg 1 golden_retriever 0.775930 True Tibetan_mastiff 0.093718 True Labrador_retriever 0.072427 True
11 666071193221509120 https://pbs.twimg.com/media/CT5cN_3WEAAlOoZ.jpg 1 Gordon_setter 0.503672 True Yorkshire_terrier 0.174201 True Pekinese 0.109454 True
12 666073100786774016 https://pbs.twimg.com/media/CT5d9DZXAAALcwe.jpg 1 Walker_hound 0.260857 True English_foxhound 0.175382 True Ibizan_hound 0.097471 True
13 666082916733198337 https://pbs.twimg.com/media/CT5m4VGWEAAtKc8.jpg 1 pug 0.489814 True bull_mastiff 0.404722 True French_bulldog 0.048960 True
14 666094000022159362 https://pbs.twimg.com/media/CT5w9gUW4AAsBNN.jpg 1 bloodhound 0.195217 True German_shepherd 0.078260 True malinois 0.075628 True
15 666099513787052032 https://pbs.twimg.com/media/CT51-JJUEAA6hV8.jpg 1 Lhasa 0.582330 True Shih-Tzu 0.166192 True Dandie_Dinmont 0.089688 True
16 666102155909144576 https://pbs.twimg.com/media/CT54YGiWUAEZnoK.jpg 1 English_setter 0.298617 True Newfoundland 0.149842 True borzoi 0.133649 True
17 666104133288665088 https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg 1 hen 0.965932 False cock 0.033919 False partridge 0.000052 False
18 666268910803644416 https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg 1 desktop_computer 0.086502 False desk 0.085547 False bookcase 0.079480 False
19 666273097616637952 https://pbs.twimg.com/media/CT8T1mtUwAA3aqm.jpg 1 Italian_greyhound 0.176053 True toy_terrier 0.111884 True basenji 0.111152 True
20 666287406224695296 https://pbs.twimg.com/media/CT8g3BpUEAAuFjg.jpg 1 Maltese_dog 0.857531 True toy_poodle 0.063064 True miniature_poodle 0.025581 True
21 666293911632134144 https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg 1 three-toed_sloth 0.914671 False otter 0.015250 False great_grey_owl 0.013207 False
22 666337882303524864 https://pbs.twimg.com/media/CT9OwFIWEAMuRje.jpg 1 ox 0.416669 False Newfoundland 0.278407 True groenendael 0.102643 True
23 666345417576210432 https://pbs.twimg.com/media/CT9Vn7PWoAA_ZCM.jpg 1 golden_retriever 0.858744 True Chesapeake_Bay_retriever 0.054787 True Labrador_retriever 0.014241 True
24 666353288456101888 https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg 1 malamute 0.336874 True Siberian_husky 0.147655 True Eskimo_dog 0.093412 True
25 666362758909284353 https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg 1 guinea_pig 0.996496 False skunk 0.002402 False hamster 0.000461 False
26 666373753744588802 https://pbs.twimg.com/media/CT9vZEYWUAAlZ05.jpg 1 soft-coated_wheaten_terrier 0.326467 True Afghan_hound 0.259551 True briard 0.206803 True
27 666396247373291520 https://pbs.twimg.com/media/CT-D2ZHWIAA3gK1.jpg 1 Chihuahua 0.978108 True toy_terrier 0.009397 True papillon 0.004577 True
28 666407126856765440 https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg 1 black-and-tan_coonhound 0.529139 True bloodhound 0.244220 True flat-coated_retriever 0.173810 True
29 666411507551481857 https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg 1 coho 0.404640 False barracouta 0.271485 False gar 0.189945 False
... ... ... ... ... ... ... ... ... ... ... ... ...
2045 886366144734445568 https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg 1 French_bulldog 0.999201 True Chihuahua 0.000361 True Boston_bull 0.000076 True
2046 886680336477933568 https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg 1 convertible 0.738995 False sports_car 0.139952 False car_wheel 0.044173 False
2047 886736880519319552 https://pbs.twimg.com/media/DE5Se8FXcAAJFx4.jpg 1 kuvasz 0.309706 True Great_Pyrenees 0.186136 True Dandie_Dinmont 0.086346 True
2048 886983233522544640 https://pbs.twimg.com/media/DE8yicJW0AAAvBJ.jpg 2 Chihuahua 0.793469 True toy_terrier 0.143528 True can_opener 0.032253 False
2049 887101392804085760 https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg 1 Samoyed 0.733942 True Eskimo_dog 0.035029 True Staffordshire_bullterrier 0.029705 True
2050 887343217045368832 https://pbs.twimg.com/ext_tw_video_thumb/88734... 1 Mexican_hairless 0.330741 True sea_lion 0.275645 False Weimaraner 0.134203 True
2051 887473957103951883 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg 2 Pembroke 0.809197 True Rhodesian_ridgeback 0.054950 True beagle 0.038915 True
2052 887517139158093824 https://pbs.twimg.com/ext_tw_video_thumb/88751... 1 limousine 0.130432 False tow_truck 0.029175 False shopping_cart 0.026321 False
2053 887705289381826560 https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg 1 basset 0.821664 True redbone 0.087582 True Weimaraner 0.026236 True
2054 888078434458587136 https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg 1 French_bulldog 0.995026 True pug 0.000932 True bull_mastiff 0.000903 True
2055 888202515573088257 https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg 2 Pembroke 0.809197 True Rhodesian_ridgeback 0.054950 True beagle 0.038915 True
2056 888554962724278272 https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg 3 Siberian_husky 0.700377 True Eskimo_dog 0.166511 True malamute 0.111411 True
2057 888804989199671297 https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg 1 golden_retriever 0.469760 True Labrador_retriever 0.184172 True English_setter 0.073482 True
2058 888917238123831296 https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg 1 golden_retriever 0.714719 True Tibetan_mastiff 0.120184 True Labrador_retriever 0.105506 True
2059 889278841981685760 https://pbs.twimg.com/ext_tw_video_thumb/88927... 1 whippet 0.626152 True borzoi 0.194742 True Saluki 0.027351 True
2060 889531135344209921 https://pbs.twimg.com/media/DFg_2PVW0AEHN3p.jpg 1 golden_retriever 0.953442 True Labrador_retriever 0.013834 True redbone 0.007958 True
2061 889638837579907072 https://pbs.twimg.com/media/DFihzFfXsAYGDPR.jpg 1 French_bulldog 0.991650 True boxer 0.002129 True Staffordshire_bullterrier 0.001498 True
2062 889665388333682689 https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg 1 Pembroke 0.966327 True Cardigan 0.027356 True basenji 0.004633 True
2063 889880896479866881 https://pbs.twimg.com/media/DFl99B1WsAITKsg.jpg 1 French_bulldog 0.377417 True Labrador_retriever 0.151317 True muzzle 0.082981 False
2064 890006608113172480 https://pbs.twimg.com/media/DFnwSY4WAAAMliS.jpg 1 Samoyed 0.957979 True Pomeranian 0.013884 True chow 0.008167 True
2065 890240255349198849 https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg 1 Pembroke 0.511319 True Cardigan 0.451038 True Chihuahua 0.029248 True
2066 890609185150312448 https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg 1 Irish_terrier 0.487574 True Irish_setter 0.193054 True Chesapeake_Bay_retriever 0.118184 True
2067 890729181411237888 https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg 2 Pomeranian 0.566142 True Eskimo_dog 0.178406 True Pembroke 0.076507 True
2068 890971913173991426 https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg 1 Appenzeller 0.341703 True Border_collie 0.199287 True ice_lolly 0.193548 False
2069 891087950875897856 https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg 1 Chesapeake_Bay_retriever 0.425595 True Irish_terrier 0.116317 True Indian_elephant 0.076902 False
2070 891327558926688256 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg 2 basset 0.555712 True English_springer 0.225770 True German_short-haired_pointer 0.175219 True
2071 891689557279858688 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg 1 paper_towel 0.170278 False Labrador_retriever 0.168086 True spatula 0.040836 False
2072 891815181378084864 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg 1 Chihuahua 0.716012 True malamute 0.078253 True kelpie 0.031379 True
2073 892177421306343426 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg 1 Chihuahua 0.323581 True Pekinese 0.090647 True papillon 0.068957 True
2074 892420643555336193 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg 1 orange 0.097049 False bagel 0.085851 False banana 0.076110 False

2075 rows × 12 columns

Some images are not categorized as dogs, with high confidence. Maybe these aren't dogs at all?

In [31]:
url=df_breed.jpg_url[df_breed.tweet_id==666051853826850816].iloc[0]
In [32]:
from PIL import Image
from io import BytesIO
r=requests.get(url)
i = Image.open(BytesIO(r.content))
i
Out[32]:
In [33]:
list(df_dog.text[df_dog.tweet_id==666051853826850816])
Out[33]:
["This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc"]

This one is clearly not a dog (the NN did not make a classification mistake). Additionally, this type of image can explain some of the really low ratings for some images.


df_dog Quality issues:

  • datatypes: timestamp is object not datetime, tweet_id should be string
  • there are 181 retweets
  • some tweets have non-10 denominators. They may not have been programmatically extracted correctly
  • some tweets have oddly low ratings numerators. They may not have been programmatically extracted correctly. Also some tweets have decimal numerators that is not reflected in the extraction.
  • name is extracted programmatically as the word following "this is" or "here is", sometimes this word is not a name, e.g. "a", "an", "the".
  • name is coded so that "None" is a valid name
  • some images can have multiple dog stages (multiple dogs?)
  • after removing retweets and replies, the in_reply_to... and retweet... columns can be removed
  • source, which describes how the tweet was posted, seems to be generally uninformative. The only potentially interesting contrast is if it was posted on vine, but this is indicated in the expanded url.

df_tweet Quality issues:

  • there are 2356 tweets in df_dog but only 2347 in df_tweets. Some tweets were deleted

df_breed Quality issues:

  • tweet_id datatype should be string.
  • Some images are not classified as dogs (either because they are not dogs or misclassification)
  • there are 2356 tweets in df_dog but only 2075 classified dogs in df_breed.

Tidyness issues:

  • the three sources can be joined into 1 table, as all values are measured on the same unit, tweet_id.

  • dog "stages" can be combined together into one category.

  • the classification format in the breed data is technically untidy, but it may not need to be fixed in this case.
  • rating_numerator and rating_denominator can be combined into a single rating value.

Clean

Define

  • dog "stages" can be combined together into one category.

create one column with output of "doggo", "floofer", "pupper", "puppo" or "multiple" change data type to category

Code

In [34]:
df_dog_clean=df_dog.copy()
df_tweet_clean=df_tweet.copy()
df_breed_clean=df_breed.copy()
In [35]:
df_dog.pupper.value_counts()
Out[35]:
None      2099
pupper     257
Name: pupper, dtype: int64
In [36]:
a=df_dog.doggo=="doggo"
b=df_dog.pupper=="pupper"

len(df_dog[a&b])
Out[36]:
12

there should be 257-12=245 puppers

In [37]:
df_dog_clean[df_dog_clean.tweet_id==817777686764523521]
Out[37]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
460 817777686764523521 NaN NaN 2017-01-07 16:59:28 +0000 <a href="http://twitter.com/download/iphone" r... This is Dido. She's playing the lead role in "... NaN NaN NaN https://twitter.com/dog_rates/status/817777686... 13 10 Dido doggo None pupper None
In [38]:
df_dog.iloc[:,13:].head()
Out[38]:
doggo floofer pupper puppo
0 None None None None
1 None None None None
2 None None None None
3 None None None None
4 None None None None
In [39]:
df_stage=df_dog_clean.doggo+df_dog_clean.floofer+df_dog_clean.pupper+df_dog_clean.puppo

df_stage
Out[39]:
0         NoneNoneNoneNone
1         NoneNoneNoneNone
2         NoneNoneNoneNone
3         NoneNoneNoneNone
4         NoneNoneNoneNone
5         NoneNoneNoneNone
6         NoneNoneNoneNone
7         NoneNoneNoneNone
8         NoneNoneNoneNone
9        doggoNoneNoneNone
10        NoneNoneNoneNone
11        NoneNoneNoneNone
12       NoneNoneNonepuppo
13        NoneNoneNoneNone
14       NoneNoneNonepuppo
15        NoneNoneNoneNone
16        NoneNoneNoneNone
17        NoneNoneNoneNone
18        NoneNoneNoneNone
19        NoneNoneNoneNone
20        NoneNoneNoneNone
21        NoneNoneNoneNone
22        NoneNoneNoneNone
23        NoneNoneNoneNone
24        NoneNoneNoneNone
25        NoneNoneNoneNone
26        NoneNoneNoneNone
27        NoneNoneNoneNone
28        NoneNoneNoneNone
29      NoneNonepupperNone
               ...        
2326      NoneNoneNoneNone
2327      NoneNoneNoneNone
2328      NoneNoneNoneNone
2329      NoneNoneNoneNone
2330      NoneNoneNoneNone
2331      NoneNoneNoneNone
2332      NoneNoneNoneNone
2333      NoneNoneNoneNone
2334      NoneNoneNoneNone
2335      NoneNoneNoneNone
2336      NoneNoneNoneNone
2337      NoneNoneNoneNone
2338      NoneNoneNoneNone
2339      NoneNoneNoneNone
2340      NoneNoneNoneNone
2341      NoneNoneNoneNone
2342      NoneNoneNoneNone
2343      NoneNoneNoneNone
2344      NoneNoneNoneNone
2345      NoneNoneNoneNone
2346      NoneNoneNoneNone
2347      NoneNoneNoneNone
2348      NoneNoneNoneNone
2349      NoneNoneNoneNone
2350      NoneNoneNoneNone
2351      NoneNoneNoneNone
2352      NoneNoneNoneNone
2353      NoneNoneNoneNone
2354      NoneNoneNoneNone
2355      NoneNoneNoneNone
Length: 2356, dtype: object
In [40]:
rep = {"NoneNoneNoneNone": "None",
        "doggoNoneNoneNone": "doggo",
        "NoneflooferNoneNone": "floofer",
        "NoneNonepupperNone": "pupper",
        "NoneNoneNonepuppo": "puppo"}

def replace_all(text, dic):
    for i, j in dic.items():
        text = text.replace(i, j)
    return text

df_stage=replace_all(df_stage.str,rep)

df_stage.value_counts()
Out[40]:
None                    1976
pupper                   245
doggo                     83
puppo                     29
doggoNonepupperNone       12
floofer                    9
doggoflooferNoneNone       1
doggoNoneNonepuppo         1
dtype: int64
In [41]:
df_stage=df_stage.str.replace(r'^doggo\w+','multiple')
df_stage.value_counts()
Out[41]:
None        1976
pupper       245
doggo         83
puppo         29
multiple      14
floofer        9
dtype: int64
In [42]:
df_dog_clean['dog_stage']=df_stage

Test

In [43]:
df_dog_clean.dog_stage.value_counts()
Out[43]:
None        1976
pupper       245
doggo         83
puppo         29
multiple      14
floofer        9
Name: dog_stage, dtype: int64
In [44]:
a=df_dog.doggo=="doggo"
b=df_dog.pupper=="pupper"

df_dog_clean[a&b].iloc[:,13:].head()
Out[44]:
doggo floofer pupper puppo dog_stage
460 doggo None pupper None multiple
531 doggo None pupper None multiple
565 doggo None pupper None multiple
575 doggo None pupper None multiple
705 doggo None pupper None multiple
In [45]:
df_dog_clean[a].iloc[:,13:].sample(10,random_state=10)
Out[45]:
doggo floofer pupper puppo dog_stage
351 doggo None None None doggo
323 doggo None None None doggo
531 doggo None pupper None multiple
877 doggo None None None doggo
448 doggo None None None doggo
108 doggo None None None doggo
624 doggo None None None doggo
1063 doggo None pupper None multiple
460 doggo None pupper None multiple
764 doggo None None None doggo
In [46]:
df_dog_clean[b].iloc[:,13:].sample(10,random_state=10)
Out[46]:
doggo floofer pupper puppo dog_stage
1811 None None pupper None pupper
575 doggo None pupper None multiple
793 None None pupper None pupper
1498 None None pupper None pupper
917 None None pupper None pupper
1243 None None pupper None pupper
453 None None pupper None pupper
1594 None None pupper None pupper
1903 None None pupper None pupper
1555 None None pupper None pupper

drop doggo, floofer, pupper and puppo columns after the test

In [47]:
df_dog_clean.drop(['doggo','floofer','pupper','puppo'],axis=1,inplace=True)

change dog_stage type to category

In [48]:
df_dog_clean.dog_stage.astype('category')
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
dog_stage                     2356 non-null object
dtypes: float64(4), int64(3), object(7)
memory usage: 257.8+ KB

Clean

Define

  • the three sources can be joined into 1 table, as all values are measured on the same unit, tweet_id.

merge the three tables on tweet_id with inner join. Make sure that all types are int64 before merging

Code

In [49]:
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
dog_stage                     2356 non-null object
dtypes: float64(4), int64(3), object(7)
memory usage: 257.8+ KB
In [50]:
df_tweet_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2347 entries, 0 to 2346
Data columns (total 4 columns):
tweet_id           2347 non-null object
favorite_count     2347 non-null int64
retweet_count      2347 non-null int64
followers_count    2347 non-null int64
dtypes: int64(3), object(1)
memory usage: 73.4+ KB
In [51]:
df_breed_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB

change df_tweet to int64 type

In [52]:
df_tweet_clean.tweet_id=df_tweet_clean.tweet_id.astype('int64')
df_tweet_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2347 entries, 0 to 2346
Data columns (total 4 columns):
tweet_id           2347 non-null int64
favorite_count     2347 non-null int64
retweet_count      2347 non-null int64
followers_count    2347 non-null int64
dtypes: int64(4)
memory usage: 73.4 KB
In [53]:
df_dog_clean=df_dog_clean.merge(df_tweet_clean,on='tweet_id')
In [54]:
df_dog_clean=df_dog_clean.merge(df_breed_clean, on='tweet_id')

Test

In [55]:
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2070 entries, 0 to 2069
Data columns (total 28 columns):
tweet_id                      2070 non-null int64
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2070 non-null object
source                        2070 non-null object
text                          2070 non-null object
retweeted_status_id           76 non-null float64
retweeted_status_user_id      76 non-null float64
retweeted_status_timestamp    76 non-null object
expanded_urls                 2070 non-null object
rating_numerator              2070 non-null int64
rating_denominator            2070 non-null int64
name                          2070 non-null object
dog_stage                     2070 non-null object
favorite_count                2070 non-null int64
retweet_count                 2070 non-null int64
followers_count               2070 non-null int64
jpg_url                       2070 non-null object
img_num                       2070 non-null int64
p1                            2070 non-null object
p1_conf                       2070 non-null float64
p1_dog                        2070 non-null bool
p2                            2070 non-null object
p2_conf                       2070 non-null float64
p2_dog                        2070 non-null bool
p3                            2070 non-null object
p3_conf                       2070 non-null float64
p3_dog                        2070 non-null bool
dtypes: bool(3), float64(7), int64(7), object(11)
memory usage: 426.5+ KB

Clean

Define

  • there are retweets and replies included in the dataset

remove any tweet that has a retweeted_status_id or a in reply_to_status_id

Code

In [56]:
notretweets=df_dog_clean.retweeted_status_id.isnull()
notreplies=df_dog_clean.in_reply_to_status_id.isnull()
df_dog_clean=df_dog_clean[notreplies&notretweets]

Test

In [57]:
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1971 entries, 0 to 2069
Data columns (total 28 columns):
tweet_id                      1971 non-null int64
in_reply_to_status_id         0 non-null float64
in_reply_to_user_id           0 non-null float64
timestamp                     1971 non-null object
source                        1971 non-null object
text                          1971 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 1971 non-null object
rating_numerator              1971 non-null int64
rating_denominator            1971 non-null int64
name                          1971 non-null object
dog_stage                     1971 non-null object
favorite_count                1971 non-null int64
retweet_count                 1971 non-null int64
followers_count               1971 non-null int64
jpg_url                       1971 non-null object
img_num                       1971 non-null int64
p1                            1971 non-null object
p1_conf                       1971 non-null float64
p1_dog                        1971 non-null bool
p2                            1971 non-null object
p2_conf                       1971 non-null float64
p2_dog                        1971 non-null bool
p3                            1971 non-null object
p3_conf                       1971 non-null float64
p3_dog                        1971 non-null bool
dtypes: bool(3), float64(7), int64(7), object(11)
memory usage: 406.1+ KB

Clean

Define

  • after removing retweets and replies, the in_reply_to... and retweet... columns can be removed

drop in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp

Code

In [58]:
drop_cols=['in_reply_to_status_id',
           'in_reply_to_user_id',
           'retweeted_status_id',
           'retweeted_status_user_id',
           'retweeted_status_timestamp']

df_dog_clean=df_dog_clean.drop(drop_cols,axis=1)

Test

In [59]:
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1971 entries, 0 to 2069
Data columns (total 23 columns):
tweet_id              1971 non-null int64
timestamp             1971 non-null object
source                1971 non-null object
text                  1971 non-null object
expanded_urls         1971 non-null object
rating_numerator      1971 non-null int64
rating_denominator    1971 non-null int64
name                  1971 non-null object
dog_stage             1971 non-null object
favorite_count        1971 non-null int64
retweet_count         1971 non-null int64
followers_count       1971 non-null int64
jpg_url               1971 non-null object
img_num               1971 non-null int64
p1                    1971 non-null object
p1_conf               1971 non-null float64
p1_dog                1971 non-null bool
p2                    1971 non-null object
p2_conf               1971 non-null float64
p2_dog                1971 non-null bool
p3                    1971 non-null object
p3_conf               1971 non-null float64
p3_dog                1971 non-null bool
dtypes: bool(3), float64(3), int64(7), object(10)
memory usage: 329.1+ KB

Clean

Define

  • name is extracted programmatically as the word following "this is", sometimes this word is not a name, e.g. "a", "an", "the".

Re-extract names requiring a capitalized word after template phrases that seem to proceed names. Probe misnamed dogs text to see if names are missed, or if the text simply has no name.

Code

In [60]:
df_dog_clean.name.value_counts()
Out[60]:
None           524
a               55
Charlie         11
Lucy            10
Oliver          10
Cooper          10
Tucker           9
Penny            9
Sadie            8
Winston          8
Lola             7
Daisy            7
the              7
Toby             7
Stanley          6
Bo               6
an               6
Bella            6
Koda             6
Jax              6
Leo              5
Bailey           5
Buddy            5
Milo             5
Louis            5
Chester          5
Oscar            5
Dave             5
Rusty            5
Scout            5
              ... 
Kayla            1
Rizzy            1
Devón            1
Snoop            1
Tupawc           1
Heinrich         1
Brockly          1
Shnuggles        1
Flash            1
Alejandro        1
Carter           1
Kathmandu        1
infuriating      1
Jockson          1
Holly            1
Mattie           1
Deacon           1
Bobb             1
Schnozz          1
Brandi           1
Shaggy           1
Sierra           1
Clarq            1
Crumpet          1
Iggy             1
Dotsy            1
Stormy           1
Bluebert         1
Dietrich         1
Snoopy           1
Name: name, Length: 935, dtype: int64
In [61]:
df_dog_clean.text[df_dog_clean.name=='a']
Out[61]:
50      Here is a pupper approaching maximum borkdrive...
520     Here is a perfect example of someone who has t...
643     Guys this is getting so out of hand. We only r...
819     This is a mighty rare blue-tailed hammer sherk...
821     Viewer discretion is advised. This is a terrib...
830     This is a carrot. We only rate dogs. Please on...
856     This is a very rare Great Alaskan Bush Pupper....
992     People please. This is a Deadly Mediterranean ...
1002    This is a taco. We only rate dogs. Please only...
1119    Here is a heartbreaking scene of an incredible...
1128    Here is a whole flock of puppers.  60/50 I'll ...
1138    This is a Butternut Cumberfloof. It's not wind...
1144    This is a Wild Tuscan Poofwiggle. Careful not ...
1156    "Pupper is a present to world. Here is a bow f...
1259    This is a rare Arctic Wubberfloof. Unamused by...
1472    Guys this really needs to stop. We've been ove...
1515    This is a dog swinging. I really enjoyed it so...
1577    This is a Sizzlin Menorah spaniel from Brookly...
1578    Seriously guys?! Only send in dogs. I only rat...
1601    C'mon guys. We've been over this. We only rate...
1602    This is a fluffy albino Bacardi Columbia mix. ...
1643    This is a Sagitariot Baklava mix. Loves her ne...
1660    This is a heavily opinionated dog. Loves walls...
1674    This is a Lofted Aphrodisiac Terrier named Kip...
1713    This is a baby Rand Paul. Curls for days. 11/1...
1753    This is a Tuscaloosa Alcatraz named Jacob (Yac...
1784    This is a Helvetica Listerine named Rufus. Thi...
1834    This is a Deciduous Trimester mix named Spork....
1843    This is a Rich Mahogany Seltzer named Cherokee...
1846    This is a Speckled Cauliflower Yosemite named ...
1864    This is a spotted Lipitor Rumpelstiltskin name...
1870    This is a brave dog. Excellent free climber. T...
1878    This is a Coriander Baton Rouge named Alfredo....
1907    This is a Slovakian Helter Skelter Feta named ...
1914    This is a wild Toblerone from Papua New Guinea...
1927    Here is a horned dog. Much grace. Can jump ove...
1933    This is a Birmingham Quagmire named Chuk. Love...
1937    Here is a mother dog caring for her pups. Snaz...
1950    This is a Trans Siberian Kellogg named Alfonso...
1964    This is a Shotokon Macadamia mix named Cheryl....
1970    This is a rare Hungarian Pinot named Jessiga. ...
1979    This is a southwest Coriander named Klint. Hat...
1988    This is a northern Wahoo named Kohl. He runs t...
2002    This is a Dasani Kingfisher from Maine. His na...
2018    This is a curly Ticonderoga named Pepe. No fee...
2025    This is a purebred Bacardi named Octaviath. Ca...
2028    This is a golden Buckminsterfullerene named Jo...
2041    This is a southern Vesuvius bumblegruff. Can d...
2048    This is a funny dog. Weird toes. Won't come do...
2061    My oh my. This is a rare blond Canadian terrie...
2062    Here is a Siberian heavily armored polar bear ...
2064    This is a truly beautiful English Wilson Staff...
2066    This is a purebred Piers Morgan. Loves to Netf...
2067    Here is a very happy pup. Big fan of well-main...
2068    This is a western brown Mitsubishi terrier. Up...
Name: text, dtype: object

The base extraction seems to use "This is \w+|"Here is \w+" or a similar template to work, but it has a lot of false positives. Here we see that there are plenty of times that the dog name is not immediately after "This is ". One thing that should be focused on for regex extraction is that the dog names will always be capitalized. This will get rid of many false positives.

Also, looking at the list of false positives, there are actually names in some of the text. Two more keys that may pick up names are shown in these tweets:

In [62]:
df_dog_clean.text[1979]
Out[62]:
'This is a southwest Coriander named Klint. Hat looks expensive. Still on house arrest :(\n9/10 https://t.co/IQTOMqDUIe'
In [63]:
df_dog_clean.text[2002]
Out[63]:
"This is a Dasani Kingfisher from Maine. His name is Daryl. Daryl doesn't like being swallowed by a panda. 8/10 https://t.co/jpaeu6LNmW"

It makes sense that "name is" or "named" may proceed the name of the dog, and they should be included in the extraction.

In [64]:
# note to self: (?: ) is a non-capture group, needed for or statement
names=df_dog_clean.text.str.extract('(?:[Tt]his is |named |name is |[Hh]ere is )([A-Z][\w\']+)',expand=True)
names[0].value_counts().head(10)
Out[64]:
Cooper     9
Lucy       9
Charlie    8
Penny      8
Tucker     8
Oliver     8
Bo         6
Bella      6
Zoey       5
Lola       5
Name: 0, dtype: int64
In [65]:
sum(names[0].isnull())
Out[65]:
898

Next I need to probe the text of tweets that did not have a name extracted, maybe there are other key phrases that are missed. I can iterate over the names with new templates if necessary.

In [66]:
with pd.option_context('display.max_colwidth',-1):
    print(df_dog_clean.text[names[0].isnull()])
5       Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh    
6       Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\n\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl
7       When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq                        
12      Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm                                    
21      I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba                                  
23      You may not have known you needed to see this today. 13/10 please enjoy (IG: emmylouroo) https://t.co/WZqNqygEyV                                                      
24      This... is a Jubilant Antarctic House Bear. We only rate dogs. Please only send dogs. Thank you... 12/10 would suffocate in floof https://t.co/4Ad1jzJSdp             
33      Here we have a corgi undercover as a malamute. Pawbably doing important investigative work. Zero control over tongue happenings. 13/10 https://t.co/44ItaMubBf        
37      I present to you, Pup in Hat. Pup in Hat is great for all occasions. Extremely versatile. Compact as h*ck. 14/10 (IG: itselizabethgales) https://t.co/vvBOcC2VdC      
38      Meet Yogi. He doesn't have any important dog meetings today he just enjoys looking his best at all times. 12/10 for dangerously dapper doggo https://t.co/YSI00BzTBZ  
41      Meet Grizzwald. He may be the floofiest floofer I ever did see. Lost eyes saving a schoolbus from a volcano erpuption. 13/10 heroic as h*ck https://t.co/rf661IFEYP   
42      Please only send dogs. We don't rate mechanics, no matter how h*ckin good. Thank you... 13/10 would sneak a pat https://t.co/Se5fZ9wp5E                               
50      Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF           
51      Meet Elliot. He's a Canadian Forrest Pup. Unusual number of antlers for a dog. Sneaky tongue slip to celebrate #Canada150. 12/10 would pet https://t.co/cgwJwowTMC    
53      Ugh not again. We only rate dogs. Please don't send in well-dressed  floppy-tongued street penguins. Dogs only please. Thank you... 12/10 https://t.co/WiAMbTkDPf     
55      Meet Jesse. He's a Fetty Woof. His tongue ejects without warning. A true bleptomaniac. 12/10 would snug well https://t.co/fUod0tVmvK                                  
56      Please don't send in photos without dogs in them. We're not @porch_rates. Insubordinate and churlish. Pretty good porch tho 11/10 https://t.co/HauE8M3Bu4             
64      Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps://t.co/245xJJElsY https://t.co/lUiQH219v6
69      Meet Dante. At first he wasn't a fan of his new raincoat, then he saw his reflection. H*ckin handsome. 13/10 for water resistant good boy https://t.co/SHRTIo5pxc     
73      Meet Venti, a seemingly caffeinated puppoccino. She was just informed the weekend would include walks, pats and scritches. 13/10 much excite https://t.co/ejExJFq3ek  
75      Meet Nugget and Hank. Nugget took Hank's bone. Hank is wondering if you would please return it to him. Both 13/10 would not intervene https://t.co/ogith9ejNj         
76      Guys please stop sending pictures without any dogs in th- oh never mind hello excuse me sir. 12/10 stealthy as h*ck https://t.co/brCQoqc8AW                           
77      Meet Cash. He hath acquired a stick. A very good stick tbh. 12/10 would pat head approvingly https://t.co/lZhtizkURD                                                  
79      I can't believe this keeps happening. This, is a birb taking a bath. We only rate dogs. Please only send dogs. Thank you... 12/10 https://t.co/pwY9PQhtP2             
81      We usually don't rate Deck-bound Saskatoon Black Bears, but this one is h*ckin flawless. Sneaky tongue slip too. 13/10 would hug firmly https://t.co/mNuMH9400n       
83      Here's a very large dog. He has a date later. Politely asked this water person to check if his breath is bad. 12/10 good to go doggo https://t.co/EMYIdoblMR          
84      Here are my favorite #dogsatpollingstations \nMost voted for a more consistent walking schedule and to increase daily pats tenfold. All 13/10 https://t.co/17FVMl4VZ5 
86      We. Only. Rate. Dogs. Do not send in other things like this fluffy floor shark clearly ready to attack. Get it together guys... 12/10 https://t.co/BZHiKx3FpQ         
89      Say hello to Lassie. She's celebrating #PrideMonth by being a splendid mix of astute and adorable. Proudly supupporting her owner. 13/10 https://t.co/uK6PNyeh9w      
93      Real funny guys. Sending in a pic without a dog in it. Hilarious. We'll rate the rug tho because it's giving off a very good vibe. 11/10 https://t.co/GCD1JccCyi      
                                                                                      ...                                                                                     
2040    This is quite the dog. Gets really excited when not in water. Not very soft tho. Bad at fetch. Can't do tricks. 2/10 https://t.co/aMCTNWO94t                          
2041    This is a southern Vesuvius bumblegruff. Can drive a truck (wow). Made friends with 5 other nifty dogs (neat). 7/10 https://t.co/LopTBkKa8h                           
2042    Oh goodness. A super rare northeast Qdoba kangaroo mix. Massive feet. No pouch (disappointing). Seems alert. 9/10 https://t.co/Dc7b0E8qFE                             
2043    Those are sunglasses and a jean jacket. 11/10 dog cool af https://t.co/uHXrPkUEyl                                                                                     
2044    Unique dog here. Very small. Lives in container of Frosted Flakes (?). Short legs. Must be rare 6/10 would still pet https://t.co/XMD9CwjEnM                          
2045    Here we have a mixed Asiago from the Galápagos Islands. Only one ear working. Big fan of marijuana carpet. 8/10 https://t.co/tltQ5w9aUO                               
2046    Look at this jokester thinking seat belt laws don't apply to him. Great tongue tho 10/10 https://t.co/VFKG1vxGjB                                                      
2047    This is an extremely rare horned Parthenon. Not amused. Wears shoes. Overall very nice. 9/10 would pet aggressively https://t.co/QpRjllzWAL                           
2048    This is a funny dog. Weird toes. Won't come down. Loves branch. Refuses to eat his food. Hard to cuddle with. 3/10 https://t.co/IIXis0zta0                            
2049    This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv                              
2050    Can take selfies 11/10 https://t.co/ws2AMaNwPW                                                                                                                        
2051    Very concerned about fellow dog trapped in computer. 10/10 https://t.co/0yxApIikpk                                                                                    
2052    Not familiar with this breed. No tail (weird). Only 2 legs. Doesn't bark. Surprisingly quick. Shits eggs. 1/10 https://t.co/Asgdc6kuLX                                
2053    Oh my. Here you are seeing an Adobe Setter giving birth to twins!!! The world is an amazing place. 11/10 https://t.co/11LvqN4WLq                                      
2054    Can stand on stump for what seems like a while. Built that birdhouse? Impressive. Made friends with a squirrel. 8/10 https://t.co/Ri4nMTLq5C                          
2055    This appears to be a Mongolian Presbyterian mix. Very tired. Tongue slip confirmed. 9/10 would lie down with https://t.co/mnioXo3IfP                                  
2056    Here we have a well-established sunblockerspaniel. Lost his other flip-flop. 6/10 not very waterproof https://t.co/3RU6x0vHB7                                         
2057    Let's hope this flight isn't Malaysian (lol). What a dog! Almost completely camouflaged. 10/10 I trust this pilot https://t.co/Yk6GHE9tOY                             
2058    Here we have a northern speckled Rhododendron. Much sass. Gives 0 fucks. Good tongue. 9/10 would caress sensually https://t.co/ZoL8kq2XFx                             
2059    This is the happiest dog you will ever see. Very committed owner. Nice couch. 10/10 https://t.co/RhUEAloehK                                                           
2060    Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p                               
2061    My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O                                          
2062    Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt                          
2063    This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc                            
2064    This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe                          
2065    Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq                                              
2066    This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx                             
2067    Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR                                    
2068    This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI                           
2069    Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj                                   
Name: text, Length: 898, dtype: object

A huge one I am missing is "Meet ". That makes sense. Also "Say hello to" pops up frequently.

In [67]:
names=df_dog_clean.text.str.extract('(?:[Tt]his is |[Mm]eet |hello to |named |name is |[Hh]ere is )([A-Z][\w\']+)',expand=True)
names[0].value_counts().head(10)
Out[67]:
Charlie    11
Cooper     10
Oliver     10
Lucy       10
Tucker      9
Penny       9
Winston     8
Sadie       8
Toby        7
Lola        7
Name: 0, dtype: int64
In [68]:
sum(names[0].isnull())
Out[68]:
591
In [69]:
with pd.option_context('display.max_colwidth',-1):
    print(df_dog_clean.text[names[0].isnull()])
5       Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh   
7       When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq                       
12      Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm                                   
21      I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba                                 
23      You may not have known you needed to see this today. 13/10 please enjoy (IG: emmylouroo) https://t.co/WZqNqygEyV                                                     
24      This... is a Jubilant Antarctic House Bear. We only rate dogs. Please only send dogs. Thank you... 12/10 would suffocate in floof https://t.co/4Ad1jzJSdp            
33      Here we have a corgi undercover as a malamute. Pawbably doing important investigative work. Zero control over tongue happenings. 13/10 https://t.co/44ItaMubBf       
37      I present to you, Pup in Hat. Pup in Hat is great for all occasions. Extremely versatile. Compact as h*ck. 14/10 (IG: itselizabethgales) https://t.co/vvBOcC2VdC     
42      Please only send dogs. We don't rate mechanics, no matter how h*ckin good. Thank you... 13/10 would sneak a pat https://t.co/Se5fZ9wp5E                              
50      Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF          
53      Ugh not again. We only rate dogs. Please don't send in well-dressed  floppy-tongued street penguins. Dogs only please. Thank you... 12/10 https://t.co/WiAMbTkDPf    
56      Please don't send in photos without dogs in them. We're not @porch_rates. Insubordinate and churlish. Pretty good porch tho 11/10 https://t.co/HauE8M3Bu4            
76      Guys please stop sending pictures without any dogs in th- oh never mind hello excuse me sir. 12/10 stealthy as h*ck https://t.co/brCQoqc8AW                          
79      I can't believe this keeps happening. This, is a birb taking a bath. We only rate dogs. Please only send dogs. Thank you... 12/10 https://t.co/pwY9PQhtP2            
81      We usually don't rate Deck-bound Saskatoon Black Bears, but this one is h*ckin flawless. Sneaky tongue slip too. 13/10 would hug firmly https://t.co/mNuMH9400n      
83      Here's a very large dog. He has a date later. Politely asked this water person to check if his breath is bad. 12/10 good to go doggo https://t.co/EMYIdoblMR         
84      Here are my favorite #dogsatpollingstations \nMost voted for a more consistent walking schedule and to increase daily pats tenfold. All 13/10 https://t.co/17FVMl4VZ5
86      We. Only. Rate. Dogs. Do not send in other things like this fluffy floor shark clearly ready to attack. Get it together guys... 12/10 https://t.co/BZHiKx3FpQ        
93      Real funny guys. Sending in a pic without a dog in it. Hilarious. We'll rate the rug tho because it's giving off a very good vibe. 11/10 https://t.co/GCD1JccCyi     
103     Here's a h*ckin peaceful boy. Unbothered by the comings and goings. 13/10 please reveal your wise ways https://t.co/yeaH8Ej5eM                                       
105     Unbelievable. We only rate dogs. Please don't send in non-canines like the "I" from Pixar's opening credits. Thank you... 12/10 https://t.co/JMhDNv5wXZ              
109     Oh my this spooked me up. We only rate dogs, not happy ghosts. Please send dogs only. It's a very simple premise. Thank you... 13/10 https://t.co/M5Rz0R8SIQ         
116     We only rate dogs. Please don't send in Jesus. We're trying to remain professional and legitimate. Thank you... 14/10 https://t.co/wr3xsjeCIR                        
127     We only rate dogs. Please don't send perfectly toasted marshmallows attempting to drive. Thank you... 13/10 https://t.co/nvZyyrp0kd                                  
129     HI. MY. NAME. IS. BOOMER. AND. I. WANT. TO. SAY. IT'S. H*CKIN. RIDICULOUS. THAT. DOGS. CAN'T VOTE. ABSOLUTE. CODSWALLUP. THANK. YOU. 13/10 https://t.co/SqKJPwbQ2g   
135     Here we have perhaps the wisest dog of all. Above average with light sabers. Immortal as h*ck. 14/10 dog, or dog not, there is no try https://t.co/upRYxG4KbG        
139     We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you... 12/10 https://t.co/g2nSyGenG9                 
140     Here we have an exotic dog. Good at ukulele. Fashionable af. Has two more arms if needed. Is blue. Knows what 'ohana means. 13/10 would pet https://t.co/gEsymGTXCT  
141     I have stumbled puppon a doggo painting party. They're looking to be the next Pupcasso or Puppollock. All 13/10 would put it on the fridge https://t.co/cUeDMlHJbq   
146     Instead of the usual nightly dog rate, I'm sharing this story with you. Meeko is 13/10 and would like your help \n\nhttps://t.co/Mj4j6QoIJk https://t.co/JdNE5oqYEV  
                                                                                       ...                                                                                   
2040    This is quite the dog. Gets really excited when not in water. Not very soft tho. Bad at fetch. Can't do tricks. 2/10 https://t.co/aMCTNWO94t                         
2041    This is a southern Vesuvius bumblegruff. Can drive a truck (wow). Made friends with 5 other nifty dogs (neat). 7/10 https://t.co/LopTBkKa8h                          
2042    Oh goodness. A super rare northeast Qdoba kangaroo mix. Massive feet. No pouch (disappointing). Seems alert. 9/10 https://t.co/Dc7b0E8qFE                            
2043    Those are sunglasses and a jean jacket. 11/10 dog cool af https://t.co/uHXrPkUEyl                                                                                    
2044    Unique dog here. Very small. Lives in container of Frosted Flakes (?). Short legs. Must be rare 6/10 would still pet https://t.co/XMD9CwjEnM                         
2045    Here we have a mixed Asiago from the Galápagos Islands. Only one ear working. Big fan of marijuana carpet. 8/10 https://t.co/tltQ5w9aUO                              
2046    Look at this jokester thinking seat belt laws don't apply to him. Great tongue tho 10/10 https://t.co/VFKG1vxGjB                                                     
2047    This is an extremely rare horned Parthenon. Not amused. Wears shoes. Overall very nice. 9/10 would pet aggressively https://t.co/QpRjllzWAL                          
2048    This is a funny dog. Weird toes. Won't come down. Loves branch. Refuses to eat his food. Hard to cuddle with. 3/10 https://t.co/IIXis0zta0                           
2049    This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv                             
2050    Can take selfies 11/10 https://t.co/ws2AMaNwPW                                                                                                                       
2051    Very concerned about fellow dog trapped in computer. 10/10 https://t.co/0yxApIikpk                                                                                   
2052    Not familiar with this breed. No tail (weird). Only 2 legs. Doesn't bark. Surprisingly quick. Shits eggs. 1/10 https://t.co/Asgdc6kuLX                               
2053    Oh my. Here you are seeing an Adobe Setter giving birth to twins!!! The world is an amazing place. 11/10 https://t.co/11LvqN4WLq                                     
2054    Can stand on stump for what seems like a while. Built that birdhouse? Impressive. Made friends with a squirrel. 8/10 https://t.co/Ri4nMTLq5C                         
2055    This appears to be a Mongolian Presbyterian mix. Very tired. Tongue slip confirmed. 9/10 would lie down with https://t.co/mnioXo3IfP                                 
2056    Here we have a well-established sunblockerspaniel. Lost his other flip-flop. 6/10 not very waterproof https://t.co/3RU6x0vHB7                                        
2057    Let's hope this flight isn't Malaysian (lol). What a dog! Almost completely camouflaged. 10/10 I trust this pilot https://t.co/Yk6GHE9tOY                            
2058    Here we have a northern speckled Rhododendron. Much sass. Gives 0 fucks. Good tongue. 9/10 would caress sensually https://t.co/ZoL8kq2XFx                            
2059    This is the happiest dog you will ever see. Very committed owner. Nice couch. 10/10 https://t.co/RhUEAloehK                                                          
2060    Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p                              
2061    My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O                                         
2062    Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt                         
2063    This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc                           
2064    This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe                         
2065    Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq                                             
2066    This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx                            
2067    Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR                                   
2068    This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI                          
2069    Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj                                  
Name: text, Length: 591, dtype: object

There are indeed still some missed dogs, but it would be hard to caputure them all without increasing false name extractions. Owner names, Holidays and famous people names are capitalized; and they create inventive breed names for dogs that are also capitalized. One idea would be to use the previously extracted names as a lookup table to match new names in the tweets that have no extracted names. However, since many dogs have human names, it would still require some manual cleanup. Here we also can see some examples of how images with mutliple dogs (and dog names) complicate things. In the scope of this project, The current extraction seems to do pretty well. There are only 591 tweets without names, and most are truely without names.

In [70]:
df_dog_clean['name']=names

Test

In [71]:
df_dog_clean.name.value_counts()
Out[71]:
Charlie      11
Cooper       10
Oliver       10
Lucy         10
Tucker        9
Penny         9
Winston       8
Sadie         8
Toby          7
Lola          7
Daisy         7
Bo            6
Bella         6
Stanley       6
Jax           6
Koda          6
Chester       5
Bailey        5
Oscar         5
Zoey          5
Milo          5
Scout         5
Rusty         5
Buddy         5
Leo           5
Louis         5
Dave          5
Gary          4
Gus           4
Jack          4
             ..
Rontu         1
Gert          1
Kayla         1
Rizzy         1
Dietrich      1
Bluebert      1
Stormy        1
Dotsy         1
Stark         1
Sora          1
Hector        1
Remy          1
Doobert       1
Rover         1
Alejandro     1
Carter        1
Kathmandu     1
Lizzie        1
Jockson       1
Holly         1
Mattie        1
Deacon        1
Bobb          1
Leonidas      1
Brandi        1
Sierra        1
Clarq         1
Crumpet       1
Iggy          1
Snoopy        1
Name: name, Length: 936, dtype: int64
In [72]:
df_dog_clean[['name','text']].sample(15,random_state=10)
Out[72]:
name text
1249 NaN Please only send in dogs. Don't submit other t...
771 NaN Here's a doggo trying to catch some fish. 8/10...
1129 NaN "YOU CAN'T HANDLE THE TRUTH" both 10/10 https:...
1696 Schnozz This is Schnozz. He's had a blurred tail since...
1131 Bella This is Bella. Based on this picture she's at ...
906 Buckley Meet Buckley. His family &amp; some neighbors ...
1028 NaN I can't even comprehend how confused this dog ...
1963 Shaggy This is Shaggy. He knows exactly how to solve ...
1516 Sandy This is Sandy. He's sexually confused. Thinks ...
1782 Holly Meet Holly. She's trying to teach small human-...
185 Georgie This is Georgie. He's very shy. Only puppears ...
1753 Jacob This is a Tuscaloosa Alcatraz named Jacob (Yac...
71 Ginger This is Ginger. She's having a ruff Monday. To...
98 Dewey This is Dewey (pronounced "covfefe"). He's hav...
402 Bauer This is Bauer. He had nothing to do with the c...
In [73]:
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1971 entries, 0 to 2069
Data columns (total 23 columns):
tweet_id              1971 non-null int64
timestamp             1971 non-null object
source                1971 non-null object
text                  1971 non-null object
expanded_urls         1971 non-null object
rating_numerator      1971 non-null int64
rating_denominator    1971 non-null int64
name                  1380 non-null object
dog_stage             1971 non-null object
favorite_count        1971 non-null int64
retweet_count         1971 non-null int64
followers_count       1971 non-null int64
jpg_url               1971 non-null object
img_num               1971 non-null int64
p1                    1971 non-null object
p1_conf               1971 non-null float64
p1_dog                1971 non-null bool
p2                    1971 non-null object
p2_conf               1971 non-null float64
p2_dog                1971 non-null bool
p3                    1971 non-null object
p3_conf               1971 non-null float64
p3_dog                1971 non-null bool
dtypes: bool(3), float64(3), int64(7), object(10)
memory usage: 409.1+ KB

Clean

Define

  • some tweets have non-10 denominators; they may not have been programmatically extracted correctly

Probe tweets with non-10 denominators to see if there is some common reason, or if the rating was not correctly extracted.

Modify rating extraction if necessary, and exclude tweets that have no ratings or strange ratings if necessary.

Code

In [74]:
df_dog_clean.rating_denominator.value_counts()
Out[74]:
10     1954
50        3
80        2
11        2
170       1
150       1
120       1
110       1
90        1
70        1
40        1
20        1
7         1
2         1
Name: rating_denominator, dtype: int64
In [75]:
list(df_dog_clean[df_dog_clean.rating_denominator!=10].text)
Out[75]:
['The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd',
 'Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx',
 'Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE',
 'After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ',
 'Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv',
 'Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a',
 'This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq',
 "Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1",
 "Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12",
 'From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK',
 "Here is a whole flock of puppers.  60/50 I'll take the lot https://t.co/9dpcw6MdWa",
 "Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ",
 'Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55',
 'This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5',
 "IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq",
 'Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw',
 'This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv']

Tricky. Sometimes the wrong thing was extracted (24/7, 9/11, 7/11, 4/20, 50/50, 1/2), and sometimes they modify the rating system to sum all the ratings across multiple dogs (45/50=9/10 for 5 dogs). It seems that ratings always have a denominator that is divisible by 10, so we can build a template that excludes common fractionals that are not divisible by 10.

This will take care of 24/7,9/11,7/11, and 1/2. 4/20 and 50/50 will need to be carefully taken care of, as it might be possible that they can exist as valid ratings, especially 50/50.

The proper rating also seems to always be the last fraction in the text, which may be leveraged.

Potentially, we can additionally normalize ratings to be out of 10.

A important thing to note is that the tweet with 24/7 in it actually has no rating at all, and should be dropped.

In [76]:
df_dog_clean[df_dog_clean.rating_denominator!=10].text
Out[76]:
342     The floofs have been released I repeat the flo...
412     Meet Sam. She smiles 24/7 &amp; secretly aspir...
731     Why does this never happen at my front door......
873     After so many requests, this is Bretagne. She ...
921     Say hello to this unbelievably well behaved sq...
964     Happy 4/20 from the squad! 13/10 for all https...
998     This is Bluebert. He just saw that both #Final...
1019    Happy Saturday here's 9 puppers on a bench. 99...
1044    Here's a brigade of puppers. All look very pre...
1062    From left to right:\nCletus, Jerome, Alejandro...
1128    Here is a whole flock of puppers.  60/50 I'll ...
1204    Happy Wednesday here's a bucket of pups. 44/40...
1377    Someone help the girl is being mugged. Several...
1402    This is Darrel. He just robbed a 7/11 and is i...
1509    IT'S PUPPERGEDDON. Total of 144/120 ...I think...
1568    Here we have an entire platoon of puppers. Tot...
2049    This is an Albanian 3 1/2 legged  Episcopalian...
Name: text, dtype: object
In [77]:
# drop the tweet with 24/7 as it has no rating
df_dog_clean=df_dog_clean.drop(index=412,axis=0)
df_dog_clean[df_dog_clean.rating_denominator!=10].text
Out[77]:
342     The floofs have been released I repeat the flo...
731     Why does this never happen at my front door......
873     After so many requests, this is Bretagne. She ...
921     Say hello to this unbelievably well behaved sq...
964     Happy 4/20 from the squad! 13/10 for all https...
998     This is Bluebert. He just saw that both #Final...
1019    Happy Saturday here's 9 puppers on a bench. 99...
1044    Here's a brigade of puppers. All look very pre...
1062    From left to right:\nCletus, Jerome, Alejandro...
1128    Here is a whole flock of puppers.  60/50 I'll ...
1204    Happy Wednesday here's a bucket of pups. 44/40...
1377    Someone help the girl is being mugged. Several...
1402    This is Darrel. He just robbed a 7/11 and is i...
1509    IT'S PUPPERGEDDON. Total of 144/120 ...I think...
1568    Here we have an entire platoon of puppers. Tot...
2049    This is an Albanian 3 1/2 legged  Episcopalian...
Name: text, dtype: object

iterate using extractall a few times, checking the levels to find examples where more than one denominator is extracted.

In [78]:
denominator=df_dog_clean.text.str.extractall('/(\d+0)')
denominator.xs(1,level='match').head()
Out[78]:
0
21 20
48 50
612 10
781 510
822 10
In [79]:
df_dog_clean.loc[21].text
Out[79]:
"I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba"

We need to avoid extracting the urls

In [80]:
denominator=df_dog_clean.text.str.extractall('[^o]/(\d+0)')
denominator.xs(0,level='match').head()
# denominator
Out[80]:
0
0 10
1 10
2 10
3 10
4 10
In [81]:
manual_rating_clean=denominator.xs(1,level='match').index.tolist()
list(df_dog_clean.loc[manual_rating_clean].text)
Out[81]:
['"Yep... just as I suspected. You\'re not flossing." 12/10 and 11/10 for the pup not flossing https://t.co/SuXcI9B7pQ',
 'This is Bookstore and Seaweed. Bookstore is tired and Seaweed is an asshole. 10/10 and 7/10 respectively https://t.co/eUGjGjjFVJ',
 'Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a',
 'This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq',
 "Meet Travis and Flurp. Travis is pretty chill but Flurp can't lie down properly. 10/10 &amp; 8/10\nget it together Flurp https://t.co/Akzl5ynMmE",
 'This is Socks. That water pup w the super legs just splashed him. Socks did not appreciate that. 9/10 and 2/10 https://t.co/8rc5I22bBf',
 "This may be the greatest video I've ever been sent. 4/10 for Charles the puppy, 13/10 overall. (Vid by @stevenxx_) https://t.co/uaJmNgXR2P",
 "Meet Oliviér. He takes killer selfies. Has a dog of his own. It leaps at random &amp; can't bark for shit. 10/10 &amp; 5/10 https://t.co/6NgsQJuSBJ",
 "When bae says they can't go out but you see them with someone else that same night. 5/10 &amp; 10/10 for heartbroken pup https://t.co/aenk0KpoWM",
 "This is Eriq. His friend just reminded him of last year's super bowl. Not cool friend\n10/10 for Eriq\n6/10 for friend https://t.co/PlEXTofdpf",
 'Meet Fynn &amp; Taco. Fynn is an all-powerful leaf lord and Taco is in the wrong place at the wrong time. 11/10 &amp; 10/10 https://t.co/MuqHPvtL8c',
 'Meet Tassy &amp; Bee. Tassy is pretty chill, but Bee is convinced the Ruffles are haunted. 10/10 &amp; 11/10 respectively https://t.co/fgORpmTN9C',
 'These two pups just met and have instantly bonded. Spectacular scene. Mesmerizing af. 10/10 and 7/10 for blue dog https://t.co/gwryaJO4tC',
 'Meet Rufio. He is unaware of the pink legless pupper wrapped around him. Might want to get that checked 10/10 &amp; 4/10 https://t.co/KNfLnYPmYh',
 'Two gorgeous dogs here. Little waddling dog is a rebel. Refuses to look at camera. Must be a preteen. 5/10 &amp; 8/10 https://t.co/YPfw7oahbD',
 "Meet Eve. She's a raging alcoholic 8/10 (would b 11/10 but pupper alcoholism is a tragic issue that I can't condone) https://t.co/U36HYQIijg",
 '10/10 for dog. 7/10 for cat. 12/10 for human. Much skill. Would pet all https://t.co/uhx5gfpx5k',
 "Meet Holly. She's trying to teach small human-like pup about blocks but he's not paying attention smh. 11/10 &amp; 8/10 https://t.co/RcksaUrGNu",
 "Meet Hank and Sully. Hank is very proud of the pumpkin they found and Sully doesn't give a shit. 11/10 and 8/10 https://t.co/cwoP1ftbrj",
 'Here we have Pancho and Peaches. Pancho is a Condoleezza Gryffindor, and Peaches is just an asshole. 10/10 &amp; 7/10 https://t.co/Lh1BsJrWPp',
 "This is Spark. He's nervous. Other dog hasn't moved in a while. Won't come when called. Doesn't fetch well 8/10&amp;1/10 https://t.co/stEodX9Aba",
 'This is Kial. Kial is either wearing a cape, which would be rad, or flashing us, which would be rude. 10/10 or 4/10 https://t.co/8zcwIoiuqR',
 'Two dogs in this one. Both are rare Jujitsu Pythagoreans. One slightly whiter than other. Long legs. 7/10 and 8/10 https://t.co/ITxxcc4v9y',
 'These are Peruvian Feldspars. Their names are Cupit and Prencer. Both resemble Rand Paul. Sick outfits 10/10 &amp; 10/10 https://t.co/ZnEMHBsAs1']
In [82]:
df_dog_clean.loc[612].text
Out[82]:
'"Yep... just as I suspected. You\'re not flossing." 12/10 and 11/10 for the pup not flossing https://t.co/SuXcI9B7pQ'
In [83]:
df_dog_clean.loc[964].text
Out[83]:
'Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a'

Now this is tricky, we know that 4/20 is not a score, but sometimes there are two legitimate scores. I can just check and manually set mutliple dog scores after cleanup for these tweets. I'll just take the last numerator and denominator for all tweets, and for these specific ones I'll make sure that they are correct, or that I add up the scores for multiple dogs (e.g. tweet index 612 will be 12/10 and 11/10->23/20).

In [84]:
denominator=denominator.reset_index()
denominator.loc[1648:1653]
Out[84]:
level_0 match 0
1648 1728 0 10
1649 1729 0 10
1650 1729 1 10
1651 1729 2 10
1652 1730 0 10
1653 1731 0 10
In [85]:
denominator=denominator[~denominator.level_0.duplicated(keep='last')]
denominator=denominator.set_index('level_0',verify_integrity=True)
denominator.loc[1728:1731]
Out[85]:
match 0
level_0
1728 0 10
1729 2 10
1730 0 10
1731 0 10
In [86]:
denominator[0]=denominator[0].astype('int64')
denominator.drop('match',axis=1,inplace=True)
In [87]:
denominator[0].value_counts()
Out[87]:
10     1959
80        2
50        2
170       1
150       1
120       1
110       1
90        1
70        1
40        1
Name: 0, dtype: int64
In [88]:
df_dog_clean['rating_denominator']=denominator

Test

In [89]:
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1970 entries, 0 to 2069
Data columns (total 23 columns):
tweet_id              1970 non-null int64
timestamp             1970 non-null object
source                1970 non-null object
text                  1970 non-null object
expanded_urls         1970 non-null object
rating_numerator      1970 non-null int64
rating_denominator    1970 non-null int64
name                  1379 non-null object
dog_stage             1970 non-null object
favorite_count        1970 non-null int64
retweet_count         1970 non-null int64
followers_count       1970 non-null int64
jpg_url               1970 non-null object
img_num               1970 non-null int64
p1                    1970 non-null object
p1_conf               1970 non-null float64
p1_dog                1970 non-null bool
p2                    1970 non-null object
p2_conf               1970 non-null float64
p2_dog                1970 non-null bool
p3                    1970 non-null object
p3_conf               1970 non-null float64
p3_dog                1970 non-null bool
dtypes: bool(3), float64(3), int64(7), object(10)
memory usage: 409.0+ KB
In [90]:
testset=[731, 873, 921, 964, 1402, 2049]
with pd.option_context('display.max_colwidth',-1):
    print(df_dog_clean[['text','rating_denominator']].loc[testset])
                                                                                                                                              text  \
731   Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE                                                                 
873   After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ   
921   Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv                      
964   Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a                                                                               
1402  This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5    
2049  This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv       

      rating_denominator  
731   150                 
873   10                  
921   170                 
964   10                  
1402  10                  
2049  10                  

Clean

Define

  • some tweets have oddly low ratings numerators, they may not have been programmatically extracted correctly

Probe tweets with low rating numerators to see if there is some common reason, or if the rating was not correctly extracted.

Code

In [91]:
df_dog_clean.rating_numerator
Out[91]:
0       13
1       13
2       12
3       13
4       12
5       13
6       13
7       13
8       13
9       14
10      13
11      13
12      13
13      12
14      13
15      13
16      12
17      13
18      13
19      12
20      13
21      14
22      13
23      13
24      12
25      13
26      13
27      13
28      12
29      13
        ..
2040     2
2041     7
2042     9
2043    11
2044     6
2045     8
2046    10
2047     9
2048     3
2049     1
2050    11
2051    10
2052     1
2053    11
2054     8
2055     9
2056     6
2057    10
2058     9
2059    10
2060     8
2061     9
2062    10
2063     2
2064    10
2065     5
2066     6
2067     9
2068     7
2069     8
Name: rating_numerator, Length: 1970, dtype: int64
In [92]:
list(df_dog_clean[df_dog_clean.rating_numerator==144].text)
Out[92]:
["IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq"]
In [93]:
list(df_dog_clean[df_dog_clean.rating_numerator.isin([0,1,204,420,44,88,1776])].text)
Out[93]:
["When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag",
 "This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh",
 'Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv',
 "Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ",
 'Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw',
 "What kind of person sends in a picture without a dog in it? 1/10 just because that's a nice table https://t.co/RDXCfk8hK0",
 'After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY',
 "Flamboyant pup here. Probably poisonous. Won't eat kibble. Doesn't bark. Slow af. Petting doesn't look fun. 1/10 https://t.co/jxukeh2BeO",
 'Never seen dog like this. Breathes heavy. Tilts head in a pattern. No bark. Shitty at fetch. Not even cordless. 1/10 https://t.co/i9iSGNn3fx',
 'This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv',
 "Not familiar with this breed. No tail (weird). Only 2 legs. Doesn't bark. Surprisingly quick. Shits eggs. 1/10 https://t.co/Asgdc6kuLX"]

A sampling of some numbers that stood out. Most of them look legitimate. Some correspond to multiple dogs and make sense. Some are referential numbers (e.g. 1776 for an "America af" dog), some are low because they are not dogs, and some are incorrectly extracted like they were with denominators above. a small modification of the denimonator extraction should work.

In [94]:
numerator=df_dog_clean.text.str.extractall('(\d+)/\d+0')
num_list=numerator.xs(1,level='match').index.tolist()
list(df_dog_clean.loc[num_list].text)
Out[94]:
['"Yep... just as I suspected. You\'re not flossing." 12/10 and 11/10 for the pup not flossing https://t.co/SuXcI9B7pQ',
 'This is Bookstore and Seaweed. Bookstore is tired and Seaweed is an asshole. 10/10 and 7/10 respectively https://t.co/eUGjGjjFVJ',
 'Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a',
 'This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq',
 "Meet Travis and Flurp. Travis is pretty chill but Flurp can't lie down properly. 10/10 &amp; 8/10\nget it together Flurp https://t.co/Akzl5ynMmE",
 'This is Socks. That water pup w the super legs just splashed him. Socks did not appreciate that. 9/10 and 2/10 https://t.co/8rc5I22bBf',
 "This may be the greatest video I've ever been sent. 4/10 for Charles the puppy, 13/10 overall. (Vid by @stevenxx_) https://t.co/uaJmNgXR2P",
 "Meet Oliviér. He takes killer selfies. Has a dog of his own. It leaps at random &amp; can't bark for shit. 10/10 &amp; 5/10 https://t.co/6NgsQJuSBJ",
 "When bae says they can't go out but you see them with someone else that same night. 5/10 &amp; 10/10 for heartbroken pup https://t.co/aenk0KpoWM",
 "This is Eriq. His friend just reminded him of last year's super bowl. Not cool friend\n10/10 for Eriq\n6/10 for friend https://t.co/PlEXTofdpf",
 'Meet Fynn &amp; Taco. Fynn is an all-powerful leaf lord and Taco is in the wrong place at the wrong time. 11/10 &amp; 10/10 https://t.co/MuqHPvtL8c',
 'Meet Tassy &amp; Bee. Tassy is pretty chill, but Bee is convinced the Ruffles are haunted. 10/10 &amp; 11/10 respectively https://t.co/fgORpmTN9C',
 'These two pups just met and have instantly bonded. Spectacular scene. Mesmerizing af. 10/10 and 7/10 for blue dog https://t.co/gwryaJO4tC',
 'Meet Rufio. He is unaware of the pink legless pupper wrapped around him. Might want to get that checked 10/10 &amp; 4/10 https://t.co/KNfLnYPmYh',
 'Two gorgeous dogs here. Little waddling dog is a rebel. Refuses to look at camera. Must be a preteen. 5/10 &amp; 8/10 https://t.co/YPfw7oahbD',
 "Meet Eve. She's a raging alcoholic 8/10 (would b 11/10 but pupper alcoholism is a tragic issue that I can't condone) https://t.co/U36HYQIijg",
 '10/10 for dog. 7/10 for cat. 12/10 for human. Much skill. Would pet all https://t.co/uhx5gfpx5k',
 "Meet Holly. She's trying to teach small human-like pup about blocks but he's not paying attention smh. 11/10 &amp; 8/10 https://t.co/RcksaUrGNu",
 "Meet Hank and Sully. Hank is very proud of the pumpkin they found and Sully doesn't give a shit. 11/10 and 8/10 https://t.co/cwoP1ftbrj",
 'Here we have Pancho and Peaches. Pancho is a Condoleezza Gryffindor, and Peaches is just an asshole. 10/10 &amp; 7/10 https://t.co/Lh1BsJrWPp',
 "This is Spark. He's nervous. Other dog hasn't moved in a while. Won't come when called. Doesn't fetch well 8/10&amp;1/10 https://t.co/stEodX9Aba",
 'This is Kial. Kial is either wearing a cape, which would be rad, or flashing us, which would be rude. 10/10 or 4/10 https://t.co/8zcwIoiuqR',
 'Two dogs in this one. Both are rare Jujitsu Pythagoreans. One slightly whiter than other. Long legs. 7/10 and 8/10 https://t.co/ITxxcc4v9y',
 'These are Peruvian Feldspars. Their names are Cupit and Prencer. Both resemble Rand Paul. Sick outfits 10/10 &amp; 10/10 https://t.co/ZnEMHBsAs1']

A lot of these look familiar, compare to the list that need to be manually fixed

In [95]:
num_list=numerator.xs(1,level='match').index.tolist()
list(set(num_list)-set(manual_rating_clean))
Out[95]:
[]

This means that the multi-extracted values here are the same as the ones with multiple extractions with denominator, these need to be fixed manually.

An important thing to check is for decimals. Are all ratings integers, and if not, are they captured well?

Here is an example with a decimal in the numerator, being misclassified.

In [96]:
list(df_dog_clean[df_dog_clean.tweet_id==883482846933004288].text)
Out[96]:
['This is Bella. She hopes her smile made you smile. If not, she is also offering you her favorite monkey. 13.5/10 https://t.co/qjrljjt948']
In [97]:
df_dog_clean[df_dog_clean.tweet_id==883482846933004288].rating_numerator
Out[97]:
40    5
Name: rating_numerator, dtype: int64
In [98]:
numerator=df_dog_clean.text.str.extractall('(\d+\.?\d*)/\d+0')
num_list=numerator.xs(1,level='match').index.tolist()
# list(df_dog_clean.loc[num_list].text)
list(set(num_list)-set(manual_rating_clean))
Out[98]:
[]
In [99]:
numerator=numerator.reset_index()
numerator=numerator[~numerator.level_0.duplicated(keep='last')]
numerator=numerator.set_index('level_0',verify_integrity=True)
numerator[0]=numerator[0].astype('float64') #keep decimals
numerator.drop('match',axis=1,inplace=True)
numerator[0].value_counts()
Out[99]:
12.00      446
10.00      410
11.00      393
13.00      255
9.00       149
8.00        98
7.00        53
14.00       34
6.00        33
5.00        31
3.00        19
4.00        16
2.00        10
1.00         5
60.00        1
84.00        1
99.00        1
1776.00      1
13.50        1
11.27        1
165.00       1
11.26        1
0.00         1
9.75         1
45.00        1
88.00        1
144.00       1
44.00        1
121.00       1
204.00       1
80.00        1
420.00       1
Name: 0, dtype: int64
In [100]:
numerator.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1970 entries, 0 to 2069
Data columns (total 1 columns):
0    1970 non-null float64
dtypes: float64(1)
memory usage: 30.8 KB
In [101]:
df_dog_clean['rating_numerator']=numerator

Test

In [102]:
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1970 entries, 0 to 2069
Data columns (total 23 columns):
tweet_id              1970 non-null int64
timestamp             1970 non-null object
source                1970 non-null object
text                  1970 non-null object
expanded_urls         1970 non-null object
rating_numerator      1970 non-null float64
rating_denominator    1970 non-null int64
name                  1379 non-null object
dog_stage             1970 non-null object
favorite_count        1970 non-null int64
retweet_count         1970 non-null int64
followers_count       1970 non-null int64
jpg_url               1970 non-null object
img_num               1970 non-null int64
p1                    1970 non-null object
p1_conf               1970 non-null float64
p1_dog                1970 non-null bool
p2                    1970 non-null object
p2_conf               1970 non-null float64
p2_dog                1970 non-null bool
p3                    1970 non-null object
p3_conf               1970 non-null float64
p3_dog                1970 non-null bool
dtypes: bool(3), float64(4), int64(6), object(10)
memory usage: 409.0+ KB
In [103]:
with pd.option_context('display.max_colwidth',-1):
    print(df_dog_clean[['text','rating_numerator']].loc[testset]) #same testset at denominator
                                                                                                                                              text  \
731   Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE                                                                 
873   After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ   
921   Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv                      
964   Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a                                                                               
1402  This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5    
2049  This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv       

      rating_numerator  
731   165.0             
873   14.0              
921   204.0             
964   13.0              
1402  10.0              
2049  9.0               
In [104]:
df_dog_clean[df_dog_clean.tweet_id==883482846933004288][['text','rating_numerator']]
Out[104]:
text rating_numerator
40 This is Bella. She hopes her smile made you sm... 13.5

Clean

Define

  • Combine scores for multiple dogs within the 24 "problem" tweet ratings to fix the numerator and denominators.

Code

In [105]:
with pd.option_context('display.max_colwidth',-1):
    for tweet in manual_rating_clean:
        print(df_dog_clean[['text','rating_numerator','rating_denominator']].loc[tweet])
#     print(df_dog_clean[['text','rating_numerator','rating_denominator']].loc[manual_rating_clean])
text                  "Yep... just as I suspected. You're not flossing." 12/10 and 11/10 for the pup not flossing https://t.co/SuXcI9B7pQ
rating_numerator      11                                                                                                                 
rating_denominator    10                                                                                                                 
Name: 612, dtype: object
text                  This is Bookstore and Seaweed. Bookstore is tired and Seaweed is an asshole. 10/10 and 7/10 respectively https://t.co/eUGjGjjFVJ
rating_numerator      7                                                                                                                               
rating_denominator    10                                                                                                                              
Name: 822, dtype: object
text                  Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a
rating_numerator      13                                                              
rating_denominator    10                                                              
Name: 964, dtype: object
text                  This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
rating_numerator      11                                                                                                                   
rating_denominator    10                                                                                                                   
Name: 998, dtype: object
text                  Meet Travis and Flurp. Travis is pretty chill but Flurp can't lie down properly. 10/10 &amp; 8/10\nget it together Flurp https://t.co/Akzl5ynMmE
rating_numerator      8                                                                                                                                               
rating_denominator    10                                                                                                                                              
Name: 1014, dtype: object
text                  This is Socks. That water pup w the super legs just splashed him. Socks did not appreciate that. 9/10 and 2/10 https://t.co/8rc5I22bBf
rating_numerator      2                                                                                                                                     
rating_denominator    10                                                                                                                                    
Name: 1136, dtype: object
text                  This may be the greatest video I've ever been sent. 4/10 for Charles the puppy, 13/10 overall. (Vid by @stevenxx_) https://t.co/uaJmNgXR2P
rating_numerator      13                                                                                                                                        
rating_denominator    10                                                                                                                                        
Name: 1226, dtype: object
text                  Meet Oliviér. He takes killer selfies. Has a dog of his own. It leaps at random &amp; can't bark for shit. 10/10 &amp; 5/10 https://t.co/6NgsQJuSBJ
rating_numerator      5                                                                                                                                                  
rating_denominator    10                                                                                                                                                 
Name: 1231, dtype: object
text                  When bae says they can't go out but you see them with someone else that same night. 5/10 &amp; 10/10 for heartbroken pup https://t.co/aenk0KpoWM
rating_numerator      10                                                                                                                                              
rating_denominator    10                                                                                                                                              
Name: 1266, dtype: object
text                  This is Eriq. His friend just reminded him of last year's super bowl. Not cool friend\n10/10 for Eriq\n6/10 for friend https://t.co/PlEXTofdpf
rating_numerator      6                                                                                                                                             
rating_denominator    10                                                                                                                                            
Name: 1281, dtype: object
text                  Meet Fynn &amp; Taco. Fynn is an all-powerful leaf lord and Taco is in the wrong place at the wrong time. 11/10 &amp; 10/10 https://t.co/MuqHPvtL8c
rating_numerator      10                                                                                                                                                 
rating_denominator    10                                                                                                                                                 
Name: 1292, dtype: object
text                  Meet Tassy &amp; Bee. Tassy is pretty chill, but Bee is convinced the Ruffles are haunted. 10/10 &amp; 11/10 respectively https://t.co/fgORpmTN9C
rating_numerator      11                                                                                                                                               
rating_denominator    10                                                                                                                                               
Name: 1524, dtype: object
text                  These two pups just met and have instantly bonded. Spectacular scene. Mesmerizing af. 10/10 and 7/10 for blue dog https://t.co/gwryaJO4tC
rating_numerator      7                                                                                                                                        
rating_denominator    10                                                                                                                                       
Name: 1558, dtype: object
text                  Meet Rufio. He is unaware of the pink legless pupper wrapped around him. Might want to get that checked 10/10 &amp; 4/10 https://t.co/KNfLnYPmYh
rating_numerator      4                                                                                                                                               
rating_denominator    10                                                                                                                                              
Name: 1620, dtype: object
text                  Two gorgeous dogs here. Little waddling dog is a rebel. Refuses to look at camera. Must be a preteen. 5/10 &amp; 8/10 https://t.co/YPfw7oahbD
rating_numerator      8                                                                                                                                            
rating_denominator    10                                                                                                                                           
Name: 1624, dtype: object
text                  Meet Eve. She's a raging alcoholic 8/10 (would b 11/10 but pupper alcoholism is a tragic issue that I can't condone) https://t.co/U36HYQIijg
rating_numerator      11                                                                                                                                          
rating_denominator    10                                                                                                                                          
Name: 1689, dtype: object
text                  10/10 for dog. 7/10 for cat. 12/10 for human. Much skill. Would pet all https://t.co/uhx5gfpx5k
rating_numerator      12                                                                                             
rating_denominator    10                                                                                             
Name: 1729, dtype: object
text                  Meet Holly. She's trying to teach small human-like pup about blocks but he's not paying attention smh. 11/10 &amp; 8/10 https://t.co/RcksaUrGNu
rating_numerator      8                                                                                                                                              
rating_denominator    10                                                                                                                                             
Name: 1782, dtype: object
text                  Meet Hank and Sully. Hank is very proud of the pumpkin they found and Sully doesn't give a shit. 11/10 and 8/10 https://t.co/cwoP1ftbrj
rating_numerator      8                                                                                                                                      
rating_denominator    10                                                                                                                                     
Name: 1831, dtype: object
text                  Here we have Pancho and Peaches. Pancho is a Condoleezza Gryffindor, and Peaches is just an asshole. 10/10 &amp; 7/10 https://t.co/Lh1BsJrWPp
rating_numerator      7                                                                                                                                            
rating_denominator    10                                                                                                                                           
Name: 1894, dtype: object
text                  This is Spark. He's nervous. Other dog hasn't moved in a while. Won't come when called. Doesn't fetch well 8/10&amp;1/10 https://t.co/stEodX9Aba
rating_numerator      1                                                                                                                                               
rating_denominator    10                                                                                                                                              
Name: 1931, dtype: object
text                  This is Kial. Kial is either wearing a cape, which would be rad, or flashing us, which would be rude. 10/10 or 4/10 https://t.co/8zcwIoiuqR
rating_numerator      4                                                                                                                                          
rating_denominator    10                                                                                                                                         
Name: 1978, dtype: object
text                  Two dogs in this one. Both are rare Jujitsu Pythagoreans. One slightly whiter than other. Long legs. 7/10 and 8/10 https://t.co/ITxxcc4v9y
rating_numerator      8                                                                                                                                         
rating_denominator    10                                                                                                                                        
Name: 1987, dtype: object
text                  These are Peruvian Feldspars. Their names are Cupit and Prencer. Both resemble Rand Paul. Sick outfits 10/10 &amp; 10/10 https://t.co/ZnEMHBsAs1
rating_numerator      10                                                                                                                                              
rating_denominator    10                                                                                                                                              
Name: 2020, dtype: object
In [106]:
numerator=[23,17,13,11,18,11,13,15,15,10,21,21,17,14,13,8,10,19,19,17,9,10,15,20]
denominator=[20,20,10,10,20,20,10,20,20,10,20,20,20,20,20,10,10,20,20,20,20,10,20,20]
In [107]:
manualfix=pd.DataFrame(index=manual_rating_clean,data={'numerator':numerator,'denominator':denominator})
manualfix.numerator
Out[107]:
612     23
822     17
964     13
998     11
1014    18
1136    11
1226    13
1231    15
1266    15
1281    10
1292    21
1524    21
1558    17
1620    14
1624    13
1689     8
1729    10
1782    19
1831    19
1894    17
1931     9
1978    10
1987    15
2020    20
Name: numerator, dtype: int64
In [108]:
df_dog_clean.rating_denominator.update(manualfix.denominator)
df_dog_clean.rating_numerator.update(manualfix.numerator)

Test

In [109]:
df_dog_clean[['rating_numerator','rating_denominator']].loc[manual_rating_clean]
Out[109]:
rating_numerator rating_denominator
612 23.0 20
822 17.0 20
964 13.0 10
998 11.0 10
1014 18.0 20
1136 11.0 20
1226 13.0 10
1231 15.0 20
1266 15.0 20
1281 10.0 10
1292 21.0 20
1524 21.0 20
1558 17.0 20
1620 14.0 20
1624 13.0 20
1689 8.0 10
1729 10.0 10
1782 19.0 20
1831 19.0 20
1894 17.0 20
1931 9.0 20
1978 10.0 10
1987 15.0 20
2020 20.0 20

Clean

Define

  • rating_numerator and rating_denominator can be combined into a single rating value.

create a variable rating by dividing rating_numerator by rating_denominator, and a variable num_dogs by dividing rating_denominator by 10.

Code

In [110]:
df_dog_clean['rating']=df_dog_clean.rating_numerator/df_dog_clean.rating_denominator
df_dog_clean['num_dogs']=df_dog_clean.rating_denominator/10
df_dog_clean[['rating','num_dogs']].sample(10,random_state=15)
Out[110]:
rating num_dogs
1363 0.5 1.0
1815 0.8 1.0
269 1.2 1.0
239 1.2 1.0
991 1.0 1.0
2036 0.6 1.0
980 1.1 1.0
1087 1.0 1.0
1113 1.1 1.0
244 1.2 1.0

Test

In [111]:
df_dog_clean[['rating_numerator','rating_denominator','rating']].sample(10,random_state=150)
Out[111]:
rating_numerator rating_denominator rating
431 12.0 10 1.2
2022 9.0 10 0.9
617 12.0 10 1.2
1277 8.0 10 0.8
156 13.0 10 1.3
805 11.0 10 1.1
64 13.0 10 1.3
1986 9.0 10 0.9
909 11.0 10 1.1
503 12.0 10 1.2
In [112]:
df_dog_clean[['rating_numerator','rating_denominator','rating']].loc[manual_rating_clean]
Out[112]:
rating_numerator rating_denominator rating
612 23.0 20 1.15
822 17.0 20 0.85
964 13.0 10 1.30
998 11.0 10 1.10
1014 18.0 20 0.90
1136 11.0 20 0.55
1226 13.0 10 1.30
1231 15.0 20 0.75
1266 15.0 20 0.75
1281 10.0 10 1.00
1292 21.0 20 1.05
1524 21.0 20 1.05
1558 17.0 20 0.85
1620 14.0 20 0.70
1624 13.0 20 0.65
1689 8.0 10 0.80
1729 10.0 10 1.00
1782 19.0 20 0.95
1831 19.0 20 0.95
1894 17.0 20 0.85
1931 9.0 20 0.45
1978 10.0 10 1.00
1987 15.0 20 0.75
2020 20.0 20 1.00
In [113]:
df_dog_clean.drop(['rating_numerator','rating_denominator'],axis=1,inplace=True)
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1970 entries, 0 to 2069
Data columns (total 23 columns):
tweet_id           1970 non-null int64
timestamp          1970 non-null object
source             1970 non-null object
text               1970 non-null object
expanded_urls      1970 non-null object
name               1379 non-null object
dog_stage          1970 non-null object
favorite_count     1970 non-null int64
retweet_count      1970 non-null int64
followers_count    1970 non-null int64
jpg_url            1970 non-null object
img_num            1970 non-null int64
p1                 1970 non-null object
p1_conf            1970 non-null float64
p1_dog             1970 non-null bool
p2                 1970 non-null object
p2_conf            1970 non-null float64
p2_dog             1970 non-null bool
p3                 1970 non-null object
p3_conf            1970 non-null float64
p3_dog             1970 non-null bool
rating             1970 non-null float64
num_dogs           1970 non-null float64
dtypes: bool(3), float64(5), int64(5), object(10)
memory usage: 409.0+ KB

Clean

Define

  • Some images are not classified as dogs (either because they are not dogs or misclassification)

Inspect a sample of images that are not classified as dogs, and determine if they should be kept or not.

Code

In [114]:
p1=df_dog_clean.p1_dog==False
p2=df_dog_clean.p2_dog==False
p3=df_dog_clean.p3_dog==False

sum(p1&p2&p3)
Out[114]:
305
In [115]:
df_dog_clean[p1&p2&p3][['text','jpg_url','p1','p2','p3']].head(8)
Out[115]:
text jpg_url p1 p2 p3
0 This is Phineas. He's a mystical boy. Only eve... https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg orange bagel banana
21 I've yet to rate a Venezuelan Hover Wiener. Th... https://pbs.twimg.com/ext_tw_video_thumb/88751... limousine tow_truck shopping_cart
27 This is Derek. He's late for a dog meeting. 13... https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg convertible sports_car car_wheel
51 Meet Elliot. He's a Canadian Forrest Pup. Unus... https://pbs.twimg.com/media/DDrk-f9WAAI-WQv.jpg tusker Indian_elephant ibex
52 This is Louis. He's crossing. It's a big deal.... https://pbs.twimg.com/media/DDm2Z5aXUAEDS2u.jpg street_sign umbrella traffic_light
61 This is Steven. He has trouble relating to oth... https://pbs.twimg.com/media/DDMD_phXoAQ1qf0.jpg tabby window_screen Egyptian_cat
93 Real funny guys. Sending in a pic without a do... https://pbs.twimg.com/media/DBW35ZsVoAEWZUU.jpg home_theater sandbar television
97 Meet Clifford. He's quite large. Also red. Goo... https://pbs.twimg.com/media/DBMV3NnXUAAm0Pp.jpg comic_book envelope book_jacket
In [116]:
i=[]
for x in range(8):
    url=df_dog_clean.jpg_url[p1&p2&p3].iloc[x]
    r=requests.get(url)
    i.append(Image.open(BytesIO(r.content)))

for img in i:
    display(img)

Test

In [117]:
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1970 entries, 0 to 2069
Data columns (total 23 columns):
tweet_id           1970 non-null int64
timestamp          1970 non-null object
source             1970 non-null object
text               1970 non-null object
expanded_urls      1970 non-null object
name               1379 non-null object
dog_stage          1970 non-null object
favorite_count     1970 non-null int64
retweet_count      1970 non-null int64
followers_count    1970 non-null int64
jpg_url            1970 non-null object
img_num            1970 non-null int64
p1                 1970 non-null object
p1_conf            1970 non-null float64
p1_dog             1970 non-null bool
p2                 1970 non-null object
p2_conf            1970 non-null float64
p2_dog             1970 non-null bool
p3                 1970 non-null object
p3_conf            1970 non-null float64
p3_dog             1970 non-null bool
rating             1970 non-null float64
num_dogs           1970 non-null float64
dtypes: bool(3), float64(5), int64(5), object(10)
memory usage: 409.0+ KB

These samples seem to show a big challenge with NN image classification: the neural network cannot choose what part of the image to "attend" to, and if the dog is not the largest object in the image or blends in, it has issues "focusing" on the dog to make a classification. Not to mention some of the images are humorously not dogs at all.

I don't think this has to be cleaned, but if we want to do an analysis with dog breeds, only highly confident dog images should be used.

Clean

Define

  • datatypes: timestamp is object not datetime, tweet_id datatype should be string.

convert timestamp type to datetime, tweet_id to string.

Code

In [118]:
df_dog_clean.timestamp=pd.to_datetime(df_dog_clean.timestamp)
In [119]:
df_dog_clean.tweet_id=df_dog_clean.tweet_id.astype('str')

Test

In [120]:
df_dog_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1970 entries, 0 to 2069
Data columns (total 23 columns):
tweet_id           1970 non-null object
timestamp          1970 non-null datetime64[ns]
source             1970 non-null object
text               1970 non-null object
expanded_urls      1970 non-null object
name               1379 non-null object
dog_stage          1970 non-null object
favorite_count     1970 non-null int64
retweet_count      1970 non-null int64
followers_count    1970 non-null int64
jpg_url            1970 non-null object
img_num            1970 non-null int64
p1                 1970 non-null object
p1_conf            1970 non-null float64
p1_dog             1970 non-null bool
p2                 1970 non-null object
p2_conf            1970 non-null float64
p2_dog             1970 non-null bool
p3                 1970 non-null object
p3_conf            1970 non-null float64
p3_dog             1970 non-null bool
rating             1970 non-null float64
num_dogs           1970 non-null float64
dtypes: bool(3), datetime64[ns](1), float64(5), int64(4), object(10)
memory usage: 409.0+ KB
In [121]:
df_dog_clean.sample(10,random_state=190)
Out[121]:
tweet_id timestamp source text expanded_urls name dog_stage favorite_count retweet_count followers_count ... p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog rating num_dogs
1506 678021115718029313 2015-12-19 01:16:45 <a href="http://twitter.com/download/iphone" r... This is Reese. He likes holding hands. 12/10 h... https://twitter.com/dog_rates/status/678021115... Reese None 14785 7035 5628197 ... 0.822048 True Doberman 0.096085 True Rottweiler 0.032709 True 1.2 1.0
1575 675710890956750848 2015-12-12 16:16:45 <a href="http://twitter.com/download/iphone" r... This is Lenny. He was just told that he couldn... https://twitter.com/dog_rates/status/675710890... Lenny None 2016 910 5628417 ... 0.441427 True miniature_schnauzer 0.248885 True Sealyham_terrier 0.164967 True 1.2 1.0
748 756303284449767430 2016-07-22 01:42:09 <a href="http://twitter.com/download/iphone" r... Pwease accept dis rose on behalf of dog. 11/10... https://twitter.com/dog_rates/status/756303284... NaN None 4312 1208 5628161 ... 0.981652 True cocker_spaniel 0.006790 True Labrador_retriever 0.004325 True 1.1 1.0
534 790337589677002753 2016-10-23 23:42:19 <a href="http://twitter.com/download/iphone" r... Meet Maggie. She can hear your cells divide. 1... https://twitter.com/dog_rates/status/790337589... Maggie None 8646 2134 5627926 ... 0.658808 True Cardigan 0.153096 True toy_terrier 0.102299 True 1.2 1.0
671 768473857036525572 2016-08-24 15:43:39 <a href="http://twitter.com/download/iphone" r... Meet Chevy. He had a late breakfast and now ha... https://twitter.com/dog_rates/status/768473857... Chevy None 14899 3875 5627931 ... 0.739170 True Chesapeake_Bay_retriever 0.246488 True kelpie 0.006892 True 1.1 1.0
1354 685547936038666240 2016-01-08 19:45:39 <a href="http://twitter.com/download/iphone" r... Everybody needs to read this. Jack is our firs... https://twitter.com/dog_rates/status/685547936... NaN pupper 35688 17459 5628193 ... 0.923987 False oscilloscope 0.009712 False hand-held_computer 0.008769 False 1.4 1.0
1284 690248561355657216 2016-01-21 19:04:15 <a href="http://twitter.com/download/iphone" r... This is Maxwell. That's his moped. He rents it... https://twitter.com/dog_rates/status/690248561... Maxwell None 1816 461 5628185 ... 0.382690 False moped 0.318017 False pickup 0.040625 False 1.1 1.0
632 773985732834758656 2016-09-08 20:45:53 <a href="http://twitter.com/download/iphone" r... Meet Winnie. She just made awkward eye contact... https://twitter.com/dog_rates/status/773985732... Winnie pupper 11773 4397 5627931 ... 0.451149 False fur_coat 0.148001 False pug 0.109570 True 1.1 1.0
2022 666817836334096384 2015-11-18 03:18:55 <a href="http://twitter.com/download/iphone" r... This is Jeph. He is a German Boston Shuttlecoc... https://twitter.com/dog_rates/status/666817836... Jeph None 534 260 5628442 ... 0.496953 True standard_schnauzer 0.285276 True giant_schnauzer 0.073764 True 0.9 1.0
1923 668627278264475648 2015-11-23 03:09:00 <a href="http://twitter.com/download/iphone" r... This is Timofy. He's a pilot for Southwest. It... https://twitter.com/dog_rates/status/668627278... Timofy None 335 122 5628432 ... 0.965403 True pug 0.008604 True Boston_bull 0.008004 True 0.9 1.0

10 rows × 23 columns

Export

In [122]:
df_dog_clean.to_csv('twitter_archive_master.csv')

Insights

Here are some insights into the data for the act_report.html.

  • Differences in retweets with classified dog breed
  • Differences in retweets with dog stage (e.g. puppers)
  • Rating vs retweets
  • If dog is named vs retweets

Dog breed retweets and favorites

In [123]:
df_dog_clean.p1.value_counts().head(15)
Out[123]:
golden_retriever            136
Labrador_retriever           94
Pembroke                     88
Chihuahua                    78
pug                          54
chow                         41
Samoyed                      40
Pomeranian                   38
toy_poodle                   37
malamute                     29
cocker_spaniel               27
French_bulldog               26
Chesapeake_Bay_retriever     23
miniature_pinscher           21
seat_belt                    21
Name: p1, dtype: int64

set some filters for classification confidence, and for the number of dogs in the image.

In [124]:
confidence=df_dog_clean.p1_conf>.6
singledog=df_dog_clean.num_dogs==1
In [125]:
df_dog_clean[(df_dog_clean.p1=='Pomeranian')
             &confidence&singledog][['favorite_count','retweet_count']].mean()
Out[125]:
favorite_count    6080.44
retweet_count     2342.80
dtype: float64
In [126]:
df_dog_clean[(df_dog_clean.p1=='pug')
             &confidence&singledog][['favorite_count','retweet_count']].mean()
Out[126]:
favorite_count    4983.578947
retweet_count     1612.421053
dtype: float64
In [127]:
df_dog_clean[(df_dog_clean.p1=='Labrador_retriever')
             &confidence&singledog][['favorite_count','retweet_count']].mean()
Out[127]:
favorite_count    15267.648148
retweet_count      5167.407407
dtype: float64
In [128]:
df_dog_clean[(df_dog_clean.p1=='golden_retriever')
             &confidence&singledog][['favorite_count','retweet_count']].mean()
Out[128]:
favorite_count    12949.528846
retweet_count      3735.682692
dtype: float64
In [129]:
breed_contrast=df_dog_clean[(df_dog_clean.p1.isin(['golden_retriever',
                                                   'Labrador_retriever',
                                                   'Pomeranian','pug']))&confidence&singledog]
breed_contrast.head()
Out[129]:
tweet_id timestamp source text expanded_urls name dog_stage favorite_count retweet_count followers_count ... p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog rating num_dogs
14 889531135344209921 2017-07-24 17:02:04 <a href="http://twitter.com/download/iphone" r... This is Stuart. He's sporting his favorite fan... https://twitter.com/dog_rates/status/889531135... Stuart puppo 15225 2272 5627901 ... 0.953442 True Labrador_retriever 0.013834 True redbone 0.007958 True 1.30 1.0
16 888917238123831296 2017-07-23 00:22:39 <a href="http://twitter.com/download/iphone" r... This is Jim. He found a fren. Taught him how t... https://twitter.com/dog_rates/status/888917238... Jim None 29323 4591 5627901 ... 0.714719 True Tibetan_mastiff 0.120184 True Labrador_retriever 0.105506 True 1.20 1.0
29 886258384151887873 2017-07-15 16:17:19 <a href="http://twitter.com/download/iphone" r... This is Waffles. His doggles are pupside down.... https://twitter.com/dog_rates/status/886258384... Waffles None 28212 6409 5627901 ... 0.943575 True shower_cap 0.025286 False Siamese_cat 0.002849 False 1.30 1.0
40 883482846933004288 2017-07-08 00:28:19 <a href="http://twitter.com/download/iphone" r... This is Bella. She hopes her smile made you sm... https://twitter.com/dog_rates/status/883482846... Bella None 46365 10189 5627902 ... 0.943082 True Labrador_retriever 0.032409 True kuvasz 0.005501 True 1.35 1.0
42 883117836046086144 2017-07-07 00:17:54 <a href="http://twitter.com/download/iphone" r... Please only send dogs. We don't rate mechanics... https://twitter.com/dog_rates/status/883117836... NaN None 37542 6804 5627902 ... 0.949562 True Labrador_retriever 0.045948 True kuvasz 0.002471 True 1.30 1.0

5 rows × 23 columns

In [130]:
breed_contrast=breed_contrast.replace({'p1':{'golden_retriever': 'Golden Retriever','Labrador_retriever':'Labrador','pug':'Pug'}})
In [131]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [132]:
g=sns.boxplot(data=breed_contrast[breed_contrast.retweet_count<breed_contrast.retweet_count.quantile(.99)],
                                  y='retweet_count',x='p1',
              order=['Labrador','Golden Retriever','Pomeranian','Pug'])
# g.set_yscale('log')
plt.ylabel('Number of retweets',fontsize=15)
plt.xlabel('Breed',fontsize=15)
plt.title('Retweets vs. Dog Breed',fontsize=20)
plt.plot()
plt.tight_layout()
plt.savefig('retweetvsdog.png')
plt.show()
In [133]:
df_dog_clean.groupby('dog_stage')[['favorite_count','retweet_count']].median()
Out[133]:
favorite_count retweet_count
dog_stage
None 3871.0 1293.0
doggo 12224.0 3256.0
floofer 11145.0 3192.0
multiple 9898.0 2774.0
pupper 3228.0 1208.0
puppo 13254.0 3070.5
In [134]:
df_dog_clean.groupby('dog_stage').tweet_id.count()
Out[134]:
dog_stage
None        1667
doggo         63
floofer        7
multiple      10
pupper       201
puppo         22
Name: tweet_id, dtype: int64
In [135]:
df_dog_clean.groupby(df_dog_clean.name.notnull())[['favorite_count','retweet_count']].median()
Out[135]:
favorite_count retweet_count
name
False 3187 1114
True 4651 1490
In [136]:
g=sns.lmplot(data=df_dog_clean,x='favorite_count',y='retweet_count')
    
g.set(xticks=range(0,150001,50000))
plt.plot([-10000,140000],[-10000,140000],'k--')
plt.ylabel('Number of Retweets',fontsize=15)
plt.xlabel('Number of Favorites',fontsize=15)
plt.title("Retweets vs. Favorites",fontsize=20)
plt.tight_layout()
plt.savefig('retweetvsfavorites.png')
plt.show()
In [137]:
import numpy as np
np.corrcoef(df_dog_clean.retweet_count,df_dog_clean.favorite_count)[0,1]
Out[137]:
0.9170878090887628
In [138]:
df_dog_clean[df_dog_clean.rating<df_dog_clean.rating.quantile(.99)].rating.median()
Out[138]:
1.1
In [139]:
g=sns.distplot(df_dog_clean[df_dog_clean.rating<df_dog_clean.rating.quantile(.99)].rating*10,
              kde=False)
    
plt.plot(11.1,200,'r*',ms=15)
plt.ylabel('Count of ratings',fontsize=15)
plt.xlabel('Dog Rating',fontsize=15)
plt.title("Distribution of Ratings",fontsize=20)

plt.tight_layout()
plt.savefig('distofrating.png')
plt.show()

Template

Clean

Define

Code

Test